diff --git a/.agents/llama-cpp-localai-paged-backend.md b/.agents/llama-cpp-localai-paged-backend.md new file mode 100644 index 000000000000..a5aa30f01aaa --- /dev/null +++ b/.agents/llama-cpp-localai-paged-backend.md @@ -0,0 +1,143 @@ +# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode) + +`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the +llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid +gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock +`llama-cpp` backend's sources and applies a vendored patch series on top at build +time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc. + +**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md` +(architecture, the patch series 0001-0030, benchmarks, dev notes, generality, +pin/canary policy). Read it for any technical detail; this guide is the maintenance +how-to. + +## Where things live + +- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the + stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at + this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the + `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`. +- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch` + series (0001-0030), nothing else. +- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The + operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and + dev artifacts live in + `backend/cpp/llama-cpp-localai-paged/docs/`. +- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh` + - the CUDA build entry points. +- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no + paged patches. + +## Invariants (do not break these) + +- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a + `patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`. +- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off + (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to- + slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add + cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`. + (Those builds also fail to link `grpc-server` on darwin/arm64 against upstream + `stream_*` server symbols - another reason it is CUDA-only.) +- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a + dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A + stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.) +- **Bit-exact by default.** Every shipped patch is byte-identical to the f32 + baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was + DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet + heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now + bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the + rejected-levers note in the backend README section 5.) + +## Fork-first workflow (MANDATORY) + +The fork **`mudler/llama.cpp` branch `localai-paged`** is the CANONICAL source +of truth for ALL paged-backend kernel and patch work. The vendored +`patches/paged/*.patch` series is a **derivative**: the fork is the source, the +series is a generated mirror of it. + +**Always update the fork FIRST, in this exact order:** + +1. **Commit the change on the `localai-paged` branch and push it.** Every + kernel or patch change lands as a fork commit first. +2. **Then regenerate the LocalAI series from the fork** via `git format-patch` + (one patch per fork commit, source-only) into + `backend/cpp/llama-cpp-localai-paged/patches/paged/`, so the series stays a + **1:1, drift-free mirror** of the branch. + +Hard rules, no exceptions: + +- **NEVER edit the `patches/paged/*.patch` files directly.** They are generated + output, not source. +- **NEVER add a patch to the series that has no corresponding fork-branch + commit.** Every `.patch` must be the `git format-patch` of a real commit on + `localai-paged`. +- The fork branch is **where the build and the per-path bit-exact md5 gate + actually run**, so it is the **only** place a change is truly validated. A + patch living only in the LocalAI series has never been built or gated. + +Verify the mirror by tree hash: applying the full on-disk series on the pin +must reproduce the fork branch tree byte-for-byte. (The patch maintenance +detail is in `backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md`; +the hard-gate is section 2.5 of `docs/PARITY_HANDOFF.md`.) + +## Maintaining the pin against new llama.cpp + +The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual +pin-sync. It is deliberately **excluded from the nightly auto-bumper** +(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches +and break `git apply` at build time. + +1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml` + runs weekly: it applies + builds the series against the latest upstream tip and + goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync. +2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new + tip (resolve conflicts; re-export **source-only** with a pathspec like + `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box, + pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm + the full grpc-server build/link is green on CI**, then bump `LLAMA_VERSION`. + +**Hard constraint: keep the pin == the stock `llama-cpp` pin.** `grpc-server.cpp` +is shared with the stock backend and tracks the stock pin. A paged pin that +diverges PAST an upstream server-API refactor breaks the grpc-server LINK even +when the patches are byte-for-byte bit-exact - the bit-exact gate alone does NOT +catch it. The `c299a92c` bump did exactly this (patches applied + greedy-md5 +bit-exact, but `grpc-server.cpp` failed to link with undefined `stream_*` server +helpers the refactor pulled into its headers), so it was reverted to `9d5d882d`. +A pin bump is shippable only once the full CI grpc-server build is green, which in +practice means moving in lockstep with the stock pin (or vendoring a +pin-matched grpc-server.cpp, which we deliberately do not, to keep stock pure). + +## The bit-exact gate (run for every change) + +- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 next patch number (gaps 0005/0027 are intentional). Update + the README's patch table and dev notes - keep the README the single doc; do not + scatter `*_RESULTS.md` files. +- Record rejected/flat levers in the README too (they stop the next person from + re-running dead ends). + +## Follow-ups (Metal / SYCL / Vulkan) + +The decode fusions are implemented for **CUDA + CPU only**. The base +gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan, +so the models **run** there via the non-fused path - what is missing is the +fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no +Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md` +(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one +PR per backend, each gated by `test-backend-ops` on the target hardware). The +methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md). diff --git a/.agents/vllm-parity-methodology.md b/.agents/vllm-parity-methodology.md new file mode 100644 index 000000000000..f58218f84f4e --- /dev/null +++ b/.agents/vllm-parity-methodology.md @@ -0,0 +1,101 @@ +# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp + +This is the playbook that took the paged backend +([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md)) +from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest +ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on +accelerator Y" effort. The *levers* are model- and hardware-specific; the +*discipline* below is not. The worked example, with all numbers, is the paged +backend README. + +## The core loop + +1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per + path) and an f32 reference. Every optimization must stay byte-identical to it - + or ship as an explicit, default-off precision opt-in. This is what lets you + optimize aggressively without silently regressing quality. Gate two ways: + greedy md5, and `test-backend-ops` against the CPU oracle. + +2. **Profile - do not assume.** nsys the steady-state decode step, broken down per + *kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong + here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state + **plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM. + Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling + window artifact (decode was 96-99% GPU-busy), not real idle. + +3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the + competitor's, side by side, per bucket, and compute the per-bucket delta. This + tells you WHERE the gap actually is - not where you would guess. It overturned + premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it + keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap. + +4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate -> + same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for + levers that are actually runtime-gated; a lever **compiled into** the binary + (e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure + it build-vs-build. The full-patchset "stock" baseline likewise needs a + **separately-built unpatched binary at the same pin** - toggling the runtime + flag on the patched binary does not reproduce stock (it measures only the gated + part; here that was ~neutral, which is exactly how this gotcha hides). Bank only + what lifts AND gates. **Record every rejected or flat lever with the reason** - + over time this is the most valuable part: it stops the next person re-running + dead ends. + +5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every + lever measured, not assumed). What remains is physical - the memory-bandwidth + floor, the irreducible serial-SSM host loop (sampling can't start until logits + land). Name it; do not claim more than you measured. + +## Hard rules learned + +- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness + (`llama-batched-bench`) is exact - lead with it. But "stock" must be a + separately-built unpatched binary at the SAME pin, NOT the patched binary with + the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM" + (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness + and config (context length alone shifted the MoE figure 76% <-> 86%). +- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM + state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate + (vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions + (recurrent-state gather-fusion + block-table cache) landed, a clean re-measure + forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs + 780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so + it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation + compile cost, zero benefit). A win measured before the rest of the series is not a + win after it. +- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the + critical path benches FLAT (the freed time becomes idle). Quantizing the bf16 + projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason. + Always measure before believing; a plausible mechanism is not a result. +- **The gate can be per-path.** Paged vs non-paged attention legitimately produces + different (equivalent) FP-reduction orders; validate the difference is benign + (KLD to f32) and then gate each path against its own reference. + +## Orchestration (multi-agent) + +- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel + design/analysis/read agents are fine; concurrent GPU benches pollute each other's + numbers. +- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to + *refute* it; majority-refute kills it. Prevents plausible-but-wrong results. +- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a + progress-file checkpoint. Agents that background work and "wait for the monitor + event" stall - forbid that pattern. +- **GPU coexistence.** On a shared host, stop the user's deployments for a clean + benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a + failure cannot strand them). + +## What generalizes (and what doesn't) + +The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions, +NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not +benefit. But the *findings* often generalize and are worth upstreaming: the +"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored +fusion ops help any backend running these models. Separate "ship our tuned backend" +from "upstream the portable op" - they are different deliverables. + +## The closing record + +Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons), +the structural ceiling, and the cross-backend / cross-quant generality. Negative +results are as valuable as wins. The paged backend README is the template. diff --git a/.docker/llama-cpp-localai-paged-compile.sh b/.docker/llama-cpp-localai-paged-compile.sh new file mode 100755 index 000000000000..8254ad691570 --- /dev/null +++ b/.docker/llama-cpp-localai-paged-compile.sh @@ -0,0 +1,39 @@ +#!/usr/bin/env bash +# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged. +# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages. + +set -euxo pipefail + +export CCACHE_DIR=/root/.ccache +ccache --max-size=5G || true +ccache -z || true + +export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache" + +if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then + CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}" + export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}" + echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}" + rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build +fi + +cd /LocalAI/backend/cpp/llama-cpp-localai-paged + +if [ -z "${BUILD_TYPE:-}" ]; then + # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries. + # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme). + if [ "${TARGETARCH}" = "arm64" ]; then + apt-get update -qq && apt-get install -y -qq gcc-14 g++-14 + export CC=gcc-14 CXX=g++-14 + fi + make llama-cpp-localai-paged-cpu-all +else + # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator + # does the compute. Keeps the GPU compile from also building the CPU variant matrix and + # avoids the gcc-14 apt step on GPU base images such as nvidia l4t. + make llama-cpp-localai-paged-fallback +fi +make llama-cpp-localai-paged-grpc +make llama-cpp-localai-paged-rpc-server + +ccache -s || true diff --git a/.github/backend-matrix.yml b/.github/backend-matrix.yml index a497a72c1e56..94b12717a683 100644 --- a/.github/backend-matrix.yml +++ b/.github/backend-matrix.yml @@ -5177,6 +5177,39 @@ include: dockerfile: "./backend/Dockerfile.golang" context: "./" ubuntu-version: '2404' + # llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each + # row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix + # swapped; builder-base-image is left UNCHANGED so these reuse the same + # base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new + # base-images.yml variant. + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/amd64' + tag-latest: 'auto' + tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged' + builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64' + runs-on: 'bigger-runner' + base-image: "ubuntu:24.04" + skip-drivers: 'false' + backend: "llama-cpp-localai-paged" + dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged" + context: "./" + ubuntu-version: '2404' + - build-type: 'cublas' + cuda-major-version: "13" + cuda-minor-version: "0" + platforms: 'linux/arm64' + skip-drivers: 'false' + tag-latest: 'auto' + tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged' + builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64' + base-image: "ubuntu:24.04" + runs-on: 'ubuntu-24.04-arm' + ubuntu-version: '2404' + backend: "llama-cpp-localai-paged" + dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged" + context: "./" # Darwin matrix (consumed by backend-jobs-darwin). includeDarwin: diff --git a/.github/scripts/paged-canary-apply.sh b/.github/scripts/paged-canary-apply.sh new file mode 100755 index 000000000000..5311f05410a2 --- /dev/null +++ b/.github/scripts/paged-canary-apply.sh @@ -0,0 +1,77 @@ +#!/usr/bin/env bash +# +# paged-canary-apply.sh - apply the vendored paged-attention patch series +# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030) to a llama.cpp checkout, the +# same way the build does, but tolerating the ONE known-benign pre-existing +# quirk in the series. Used by the early-warning canary +# (.github/workflows/llama-cpp-paged-canary.yml) so it only goes red on a REAL +# upstream break, never on that quirk. +# +# Usage: paged-canary-apply.sh +# is normally backend/cpp/llama-cpp-localai-paged/patches (it holds the +# top-level base series 0*.patch, currently empty, and the paged/ subseries). +# +# Exit 0 = the whole series applied -> patches still fit upstream. +# Exit !=0 = a patch failed to apply = the red signal: an upstream change moved +# the tree out from under the patches, so it is time to run a PIN_SYNC. +# +# Apply method MIRRORS backend/cpp/llama-cpp/Makefile's `llama.cpp` target: +# plain `git apply --verbose`, which natively tolerates @@ line-number offsets +# but NOT context-line changes. Matching the build's method is the point - the +# canary's apply result is exactly what the real build's apply would do. +# +# The ONLY tolerance, and it is path-scoped (not a blanket `|| true`): patch +# 0019 carries a stray *modify* hunk against the dev-only doc +# SSM_DECODE_FIX_RESULTS.md, a file that exists only on the DGX dev tree and is +# absent from any clean upstream checkout. `git apply` is atomic, so that single +# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028 +# build on 0019's code, the rejection cascades to them too. This is a +# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an +# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7, +# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still +# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019 +# still fails the canary. prepare.sh tolerates the same hunk via +# `patch ... || true`; this mirrors that tolerance precisely. + +set -euo pipefail + +CHECKOUT="${1:?usage: paged-canary-apply.sh }" +PATCHES="${2:?usage: paged-canary-apply.sh }" + +# The lone tolerated dev-doc, and the only patch allowed to carry it. +DEVDOC_GLOB='*SSM_DECODE_FIX_RESULTS.md' +DEVDOC_PATCH='0019-qwen35-ssm-decode-fused-gather.patch' + +# Resolve to absolute paths so the apply works after we cd into the checkout. +PATCHES="$(cd "$PATCHES" && pwd)" +cd "$CHECKOUT" + +shopt -s nullglob + +apply_one() { + local p="$1"; shift + echo "paged-canary: applying $(basename "$p")" + if ! git apply --verbose "$@" "$p"; then + echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")" + echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly" + exit 1 + fi +} + +# Base series first (parity with the build: patches/0*.patch before +# patches/paged/0*.patch). Currently empty; nullglob makes this a no-op. +for p in "$PATCHES"/0*.patch; do + apply_one "$p" +done + +# Paged series, in order. +for p in "$PATCHES"/paged/0*.patch; do + if [ "$(basename "$p")" = "$DEVDOC_PATCH" ]; then + # Apply 0019's real code hunks; exclude ONLY the benign dev-doc hunk. + apply_one "$p" --exclude="$DEVDOC_GLOB" + else + apply_one "$p" + fi +done + +echo "paged-canary: the full paged patch series applied cleanly to the upstream tip" diff --git a/.github/workflows/backend_build_darwin.yml b/.github/workflows/backend_build_darwin.yml index c0ded5b85a9f..36b8b393fe67 100644 --- a/.github/workflows/backend_build_darwin.yml +++ b/.github/workflows/backend_build_darwin.yml @@ -169,14 +169,14 @@ jobs: # invalidates cleanly; restore-keys fall back to the latest entry for the # same pin so unchanged TUs stay warm even when the cache is fresh. - name: Compute llama.cpp version - if: inputs.backend == 'llama-cpp' + if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged' id: llama-version run: | version=$(grep '^LLAMA_VERSION' backend/cpp/llama-cpp/Makefile | head -1 | cut -d= -f2 | cut -d'?' -f1 | tr -d ' ') echo "version=${version}" >> "$GITHUB_OUTPUT" - name: Restore ccache - if: inputs.backend == 'llama-cpp' + if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged' id: ccache-cache uses: actions/cache/restore@v4 with: @@ -186,7 +186,7 @@ jobs: ccache-llama-${{ runner.arch }}-${{ steps.llama-version.outputs.version }}- - name: Configure ccache - if: inputs.backend == 'llama-cpp' + if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged' run: | mkdir -p "$HOME/Library/Caches/ccache" ccache -M 2G @@ -251,9 +251,14 @@ jobs: BACKEND=${{ inputs.backend }} BUILD_TYPE=${{ inputs.build-type }} USE_PIP=${{ inputs.use-pip }} make build-darwin-${{ inputs.lang }}-backend - name: ccache stats - if: inputs.backend == 'llama-cpp' + if: inputs.backend == 'llama-cpp' || inputs.backend == 'llama-cpp-localai-paged' run: ccache -s + # Only stock llama-cpp persists the ccache: both backends share the same + # ccache-llama--- key, so the paged job restores from + # the shared prefix (warm) but must NOT also save under the identical key in + # the same run (it would collide). The shared upstream TUs stay warm via the + # stock save; the paged-only patched TUs are a small recompile. - name: Save ccache if: inputs.backend == 'llama-cpp' && github.event_name != 'pull_request' uses: actions/cache/save@v4 diff --git a/.github/workflows/bump_deps.yaml b/.github/workflows/bump_deps.yaml index afbe55b0b648..90861a4024d1 100644 --- a/.github/workflows/bump_deps.yaml +++ b/.github/workflows/bump_deps.yaml @@ -9,6 +9,23 @@ jobs: strategy: fail-fast: false matrix: + # NOTE: there is intentionally NO entry for the llama-cpp-localai-paged + # backend. It carries a vendored paged-attention patch series + # (backend/cpp/llama-cpp-localai-paged/patches/paged/) hand-verified bit-exact against + # ONE specific llama.cpp tip; a naive nightly bump would move the tip out + # from under the patches and break `git apply` at build time. Its pin is + # therefore decoupled (its own LLAMA_VERSION in + # backend/cpp/llama-cpp-localai-paged/Makefile) and advanced ONLY by the + # manual PIN_SYNC process. Do not add it here. (turboquant CAN be + # auto-bumped below because its fork branch carries the patches.) + # + # Excluding it from the auto-bumper removed the early warning of upstream + # drift; that signal is restored separately by the dedicated canary + # .github/workflows/llama-cpp-paged-canary.yml, which weekly applies + + # compiles the paged series against the latest llama.cpp tip and goes red + # when upstream breaks it (prompting a PIN_SYNC). The canary is + # signal-only - it never opens a bump PR and never moves the pin - so + # this dep-bump workflow and its PRs stay green regardless. include: - repository: "ggml-org/llama.cpp" variable: "LLAMA_VERSION" diff --git a/.github/workflows/llama-cpp-paged-canary.yml b/.github/workflows/llama-cpp-paged-canary.yml new file mode 100644 index 000000000000..b79db5441768 --- /dev/null +++ b/.github/workflows/llama-cpp-paged-canary.yml @@ -0,0 +1,179 @@ +name: 'llama.cpp paged patches: upstream canary' + +# EARLY-WARNING CANARY for the vendored paged-attention patch series +# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030). +# +# WHY THIS EXISTS +# The paged backend (backend/cpp/llama-cpp-localai-paged) pins its OWN verified +# llama.cpp tip (LLAMA_VERSION in backend/cpp/llama-cpp-localai-paged/Makefile) +# and is intentionally EXCLUDED from the nightly auto-bumper +# (.github/workflows/bump_deps.yaml), so a naive upstream bump can never silently +# break the shipped build. The cost of that safety: nobody finds out when +# upstream DRIFTS past the patches. This canary restores that signal WITHOUT +# touching the shipped pin - weekly it tries the patch series + a real compile +# against the LATEST llama.cpp master tip and goes red the moment upstream breaks +# the patches. +# +# RED HERE means: time to run a PIN_SYNC (rebase the patches onto the new tip, +# pass the bit-exact gate on the GPU, re-export the .patch files, THEN advance +# the pin in backend/cpp/llama-cpp-localai-paged/Makefile). See the backend README +# section 7 (Pin + maintenance policy): +# backend/cpp/llama-cpp-localai-paged/README.md. +# +# SIGNAL-ONLY: this workflow moves no pinned version, ships nothing, and is fully +# decoupled from bump_deps - so the main dep-bump PR stays green regardless. A +# green run means "the paged series still applies and compiles on upstream HEAD"; +# a red run means "upstream moved - schedule a pin-sync". + +on: + schedule: + # Weekly (Mondays 06:00 UTC), mirroring the weekly DEPS_REFRESH / bump_deps + # cadence. Offset from bump_deps' nightly 20:00 so the two never pile up. + - cron: '0 6 * * 1' + workflow_dispatch: + +permissions: + contents: read + +concurrency: + group: llama-cpp-paged-canary + cancel-in-progress: false + +env: + # Upstream source of truth - the same repo/branch bump_deps tracks for the + # stock llama-cpp pin. + LLAMA_UPSTREAM: 'https://github.com/ggml-org/llama.cpp' + +jobs: + apply-check: + # Cheap, fast, toolchain-free early warning: does the series still APPLY to + # the latest upstream tip? A patch no longer applying is by far the most + # common way upstream breaks a vendored series, so this runs first, is + # reliable on a free runner, and feeds the resolved tip to the compile job. + if: github.repository == 'mudler/LocalAI' + runs-on: ubuntu-latest + timeout-minutes: 20 + outputs: + tip: ${{ steps.resolve.outputs.tip }} + steps: + - name: Checkout LocalAI + uses: actions/checkout@v7 + + - name: Resolve latest llama.cpp master tip + id: resolve + run: | + tip="$(git ls-remote "$LLAMA_UPSTREAM" refs/heads/master | cut -f1)" + if [ -z "$tip" ]; then + echo "::error::could not resolve llama.cpp master tip from $LLAMA_UPSTREAM" + exit 1 + fi + pin="$(grep -m1 'LLAMA_VERSION?=' backend/cpp/llama-cpp-localai-paged/Makefile | cut -d= -f2)" + echo "latest llama.cpp master tip: $tip" + echo "shipped paged pin: $pin" + echo "tip=$tip" >> "$GITHUB_OUTPUT" + { + echo "## llama.cpp paged canary" + echo "" + echo "- upstream master tip: \`$tip\`" + echo "- shipped paged pin: \`$pin\`" + } >> "$GITHUB_STEP_SUMMARY" + + - name: Checkout llama.cpp at latest tip (shallow) + run: | + mkdir -p /tmp/llama.cpp + cd /tmp/llama.cpp + git init -q + git remote add origin "$LLAMA_UPSTREAM" + git fetch -q --depth 1 origin "${{ steps.resolve.outputs.tip }}" + git checkout -q FETCH_HEAD + git log --oneline -1 + + - name: Apply paged patch series (build's git-apply method) + run: | + bash .github/scripts/paged-canary-apply.sh \ + /tmp/llama.cpp \ + "$PWD/backend/cpp/llama-cpp-localai-paged/patches" + echo "- apply: full paged series applies to the upstream tip :white_check_mark:" >> "$GITHUB_STEP_SUMMARY" + + compile: + # Proves the patches still COMPILE against the latest tip, using the SAME + # toolchain + build target the shipped paged backend uses (the + # base-grpc-cuda-12 builder base + the Makefile `grpc-server` cublas target), + # so a failure means upstream drift, not toolchain noise. CUDA is compiled + # (nvcc; no GPU required) because most of the paged series is CUDA kernels. + # Runs only if the apply check passed, on the exact tip it validated. + # + # If a full CUDA compile on the hosted runner ever proves too heavy/flaky, + # switch `runs-on` to 'bigger-runner' (the runner class the real paged CUDA + # build uses), or drop to a CPU build (BUILD_TYPE='') which still compiles + # all host + CPU paged code, leaving CUDA-kernel coverage to the apply check + # plus the manual PIN_SYNC GPU gate. + needs: apply-check + if: github.repository == 'mudler/LocalAI' + runs-on: ubuntu-latest + timeout-minutes: 180 + steps: + - name: Checkout LocalAI + uses: actions/checkout@v7 + + - name: Free disk space + uses: ./.github/actions/free-disk-space + with: + mode: hosted + + - name: Login to Quay.io + uses: docker/login-action@v4 + with: + registry: quay.io + username: ${{ secrets.LOCALAI_REGISTRY_USERNAME }} + password: ${{ secrets.LOCALAI_REGISTRY_PASSWORD }} + + - name: Compile paged backend against latest tip (cublas) + env: + TIP: ${{ needs.apply-check.outputs.tip }} + BUILDER_BASE_IMAGE: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-12-amd64' + run: | + docker run --rm \ + -v "$PWD":/LocalAI -w /LocalAI \ + -e TIP -e LLAMA_UPSTREAM \ + "$BUILDER_BASE_IMAGE" bash -euxo pipefail -c ' + # Mirror the Dockerfile: gRPC lives at /opt/grpc in the base image; + # copy it to the prefix CMake find_package expects. + cp -a /opt/grpc/. /usr/local/ + + # Pre-populate the llama.cpp checkout at the latest tip with the + # paged series applied via the tolerant canary apply. Because + # backend/cpp/llama-cpp/llama.cpp now exists, the stock Makefile's + # llama.cpp target (clone + base-patch apply) is skipped and the + # now patch-free prepare.sh only copies the grpc-server sources - + # so we drive the REAL grpc-server build path on top of our paged + # apply. The stock llama-cpp backend no longer carries the paged + # series (it lives in backend/cpp/llama-cpp-localai-paged/patches/ + # paged); we build it here in the stock dir only because that is + # where the shared build infra (Makefile / grpc-server.cpp / + # CMakeLists.txt / prepare.sh) lives. + cd backend/cpp/llama-cpp/ + mkdir -p llama.cpp + cd llama.cpp + git init -q + git remote add origin "$LLAMA_UPSTREAM" + git fetch -q --depth 1 origin "$TIP" + git checkout -q FETCH_HEAD + cd /LocalAI + bash .github/scripts/paged-canary-apply.sh \ + backend/cpp/llama-cpp/llama.cpp \ + "$PWD/backend/cpp/llama-cpp-localai-paged/patches" + + # Cheapest real CUDA build that proves the patches compile: one + # CUDA arch, cublas. CMAKE_ARGS is passed via the environment (not + # as a make arg) so the Makefile += flags are still appended, + # exactly like .docker/llama-cpp-localai-paged-compile.sh. The paged + # series is already applied to the checkout above, so the stock + # build just compiles the patched tree. + cd backend/cpp/llama-cpp/ + BUILD_TYPE=cublas \ + CMAKE_ARGS="-DCMAKE_CUDA_ARCHITECTURES=80" \ + make grpc-server + test -x grpc-server + ' + echo "- compile: paged series builds (cublas) against the upstream tip :white_check_mark:" >> "$GITHUB_STEP_SUMMARY" diff --git a/.gitignore b/.gitignore index 91582c006bf3..567843acb1e8 100644 --- a/.gitignore +++ b/.gitignore @@ -9,6 +9,15 @@ prepare-sources /backend/cpp/llama-cpp/llama.cpp /backend/cpp/llama-* !backend/cpp/llama-cpp +# llama-cpp-localai-paged is a tracked source dir (a thin wrapper Makefile over +# backend/cpp/llama-cpp). Re-include it like llama-cpp above; its sibling +# *-build dirs are still ignored by the /backend/cpp/llama-* rule, and its +# in-dir build artifacts (binaries, package output, collected ggml .so set) are +# re-ignored just below. +!backend/cpp/llama-cpp-localai-paged +/backend/cpp/llama-cpp-localai-paged/llama-cpp-localai-paged-* +/backend/cpp/llama-cpp-localai-paged/package +/backend/cpp/llama-cpp-localai-paged/ggml-shared-libs /backends /backend-images /result.yaml diff --git a/AGENTS.md b/AGENTS.md index 1095ef5316e3..dd2d59f5ddcc 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -23,6 +23,8 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants] | [.agents/adding-backends.md](.agents/adding-backends.md) | Adding a new backend (Python, Go, or C++) — full step-by-step checklist, including importer integration (the `/import-model` dropdown is server-driven from `GET /backends/known`) | | [.agents/coding-style.md](.agents/coding-style.md) | Code style, editorconfig, logging, documentation conventions | | [.agents/llama-cpp-backend.md](.agents/llama-cpp-backend.md) | Working on the llama.cpp backend — architecture, updating, tool call parsing | +| [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) | Working on the CUDA-only paged-attention llama.cpp variant (Qwen3.6 hybrid-SSM / Blackwell NVFP4 decode) - patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, CUDA-only invariants, stock-stays-pure, Metal/SYCL/Vulkan follow-up scope | +| [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) | The methodology for closing the vLLM decode-throughput gap in llama.cpp - bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B discipline, recording rejected levers, multi-agent GPU orchestration | | [.agents/vllm-backend.md](.agents/vllm-backend.md) | Working on the vLLM / vLLM-omni backends — native parsers, ChatDelta, CPU build, libnuma packaging, backend hooks | | [.agents/sglang-backend.md](.agents/sglang-backend.md) | Working on the SGLang backend — `engine_args` validation against ServerArgs, speculative-decoding (EAGLE/EAGLE3/DFLASH/MTP) recipes, parser handling | | [.agents/ds4-backend.md](.agents/ds4-backend.md) | Working on the ds4 backend - DSML state machine, thinking modes, KV cache, Metal+CUDA matrix | @@ -37,6 +39,7 @@ LocalAI follows the Linux kernel project's [guidelines for AI coding assistants] - **Git hooks & coverage gates**: Run `make install-hooks` once per clone so the pre-commit lint + coverage gates run. **Never bypass them with `git commit --no-verify`, and never lower a coverage baseline or widen a gate's tolerance to turn a red gate green** — the coverage ratchet only moves up. If a change drops coverage, add tests to raise it (e.g. render-smoke specs). See [.agents/building-and-testing.md](.agents/building-and-testing.md). - **Logging**: Use `github.com/mudler/xlog` (same API as slog) +- **Paged llama.cpp backend**: `llama-cpp-localai-paged` is a CUDA-only variant that owns its own patch series + its own pinned llama.cpp (manual pin-sync, weekly canary); the stock `llama-cpp` backend stays patch-free. Read [.agents/llama-cpp-localai-paged-backend.md](.agents/llama-cpp-localai-paged-backend.md) before touching either, and [.agents/vllm-parity-methodology.md](.agents/vllm-parity-methodology.md) for the decode-parity methodology behind it. - **Go style**: Prefer `any` over `interface{}` - **Comments**: Explain *why*, not *what* - **Docs**: Update `docs/content/` when adding features or changing config diff --git a/Makefile b/Makefile index 2a8edc3fc01e..0bf77a668da6 100644 --- a/Makefile +++ b/Makefile @@ -1,5 +1,5 @@ # Disable parallel execution for backend builds -.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin +.NOTPARALLEL: backends/diffusers backends/llama-cpp backends/turboquant backends/outetts backends/piper backends/stablediffusion-ggml backends/whisper backends/crispasr backends/parakeet-cpp backends/faster-whisper backends/silero-vad backends/local-store backends/huggingface backends/rfdetr backends/rfdetr-cpp backends/insightface backends/speaker-recognition backends/kitten-tts backends/kokoro backends/chatterbox backends/llama-cpp-darwin backends/neutts build-darwin-python-backend build-darwin-go-backend backends/mlx backends/diffuser-darwin backends/mlx-vlm backends/mlx-audio backends/mlx-distributed backends/stablediffusion-ggml-darwin backends/vllm backends/vllm-omni backends/sglang backends/moonshine backends/pocket-tts backends/qwen-tts backends/faster-qwen3-tts backends/qwen-asr backends/nemo backends/voxcpm backends/whisperx backends/ace-step backends/acestep-cpp backends/fish-speech backends/voxtral backends/opus backends/trl backends/llama-cpp-quantization backends/kokoros backends/sam3-cpp backends/qwen3-tts-cpp backends/omnivoice-cpp backends/vibevoice-cpp backends/localvqe backends/tinygrad backends/sherpa-onnx backends/ds4 backends/ds4-darwin backends/liquid-audio backends/supertonic backends/depth-anything-cpp backends/privacy-filter backends/privacy-filter-darwin backends/llama-cpp-localai-paged GOCMD=go GOTEST=$(GOCMD) test @@ -671,6 +671,15 @@ test-extra-backend-llama-cpp: docker-build-llama-cpp test-extra-backend-ik-llama-cpp: docker-build-ik-llama-cpp BACKEND_IMAGE=local-ai-backend:ik-llama-cpp $(MAKE) test-extra-backend +## llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Same +## GGUF surface as stock llama-cpp (the paged engine is runtime-gated by the +## LLAMA_KV_PAGED env the grpc-server option hooks set), so the standard +## llama-cpp capability set is what we exercise here. +test-extra-backend-llama-cpp-localai-paged: docker-build-llama-cpp-localai-paged + BACKEND_IMAGE=local-ai-backend:llama-cpp-localai-paged \ + BACKEND_TEST_CAPS=health,load,predict,stream,logprobs,logit_bias \ + $(MAKE) test-extra-backend + ## turboquant: exercises the llama.cpp-fork backend with the fork's ## *TurboQuant-specific* KV-cache types (turbo3 for both K and V). turbo3 ## is what makes this backend distinct from stock llama-cpp — picking q8_0 @@ -1181,6 +1190,10 @@ BACKEND_IK_LLAMA_CPP = ik-llama-cpp|ik-llama-cpp|.|false|false # turboquant is a llama.cpp fork with TurboQuant KV-cache quantization. # Reuses backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile. BACKEND_TURBOQUANT = turboquant|turboquant|.|false|false +# llama-cpp-localai-paged = stock llama.cpp grpc-server + the LocalAI paged-attention +# patch series (vendored in this wrapper backend). Reuses backend/cpp/llama-cpp sources via a thin +# wrapper Makefile (same upstream pin as stock llama-cpp; no fork, no patch-grpc-server). +BACKEND_LLAMA_CPP_LOCALAI_PAGED = llama-cpp-localai-paged|llama-cpp-localai-paged|.|false|false # ds4 is antirez/ds4, a DeepSeek V4 Flash-specific inference engine. # Single-model; hardware-only validation lives at tests/e2e-backends/ # (BACKEND_BINARY mode); see docs/superpowers/plans/2026-05-11-ds4-backend.md. @@ -1282,6 +1295,7 @@ endef $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP))) $(eval $(call generate-docker-build-target,$(BACKEND_IK_LLAMA_CPP))) $(eval $(call generate-docker-build-target,$(BACKEND_TURBOQUANT))) +$(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_LOCALAI_PAGED))) $(eval $(call generate-docker-build-target,$(BACKEND_DS4))) $(eval $(call generate-docker-build-target,$(BACKEND_PRIVACY_FILTER))) $(eval $(call generate-docker-build-target,$(BACKEND_PIPER))) @@ -1345,7 +1359,7 @@ $(eval $(call generate-docker-build-target,$(BACKEND_SUPERTONIC))) docker-save-%: backend-images docker save local-ai-backend:$* -o backend-images/$*.tar -docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter +docker-build-backends: docker-build-llama-cpp docker-build-ik-llama-cpp docker-build-turboquant docker-build-llama-cpp-localai-paged docker-build-ds4 docker-build-rerankers docker-build-vllm docker-build-vllm-omni docker-build-sglang docker-build-transformers docker-build-outetts docker-build-diffusers docker-build-kokoro docker-build-faster-whisper docker-build-crispasr docker-build-coqui docker-build-chatterbox docker-build-vibevoice docker-build-liquid-audio docker-build-moonshine docker-build-pocket-tts docker-build-qwen-tts docker-build-fish-speech docker-build-faster-qwen3-tts docker-build-qwen-asr docker-build-nemo docker-build-voxcpm docker-build-whisperx docker-build-ace-step docker-build-acestep-cpp docker-build-voxtral docker-build-mlx-distributed docker-build-trl docker-build-llama-cpp-quantization docker-build-tinygrad docker-build-kokoros docker-build-sam3-cpp docker-build-rfdetr-cpp docker-build-qwen3-tts-cpp docker-build-omnivoice-cpp docker-build-vibevoice-cpp docker-build-localvqe docker-build-insightface docker-build-speaker-recognition docker-build-sherpa-onnx docker-build-cloud-proxy docker-build-supertonic docker-build-depth-anything-cpp docker-build-privacy-filter ######################################################## ### Mock Backend for E2E Tests diff --git a/backend/Dockerfile.llama-cpp-localai-paged b/backend/Dockerfile.llama-cpp-localai-paged new file mode 100644 index 000000000000..03dc913bf31c --- /dev/null +++ b/backend/Dockerfile.llama-cpp-localai-paged @@ -0,0 +1,163 @@ +ARG BASE_IMAGE=ubuntu:24.04 +# BUILDER_BASE_IMAGE defaults to BASE_IMAGE so the Dockerfile parses even +# when no prebuilt base is supplied. The builder-prebuilt stage is only +# entered when BUILDER_TARGET=builder-prebuilt, so a "wrong" fallback +# content here is harmless — BuildKit prunes the unreferenced builder. +ARG BUILDER_BASE_IMAGE=${BASE_IMAGE} +# BUILDER_TARGET selects which builder stage the final scratch image copies +# package output from. Declared at global scope (before any FROM) so it's +# usable in `FROM ${BUILDER_TARGET}` below. Default keeps local +# `make backends/llama-cpp-localai-paged` on the from-source path. +ARG BUILDER_TARGET=builder-fromsource +ARG APT_MIRROR="" +ARG APT_PORTS_MIRROR="" + + +# ============================================================================ +# Stage: builder-fromsource — self-contained build path. +# Runs .docker/install-base-deps.sh (apt deps + cmake + protoc + gRPC + +# conditional CUDA/ROCm/Vulkan), copies /opt/grpc to /usr/local, then +# compiles the variant. Used when BUILDER_TARGET=builder-fromsource (the +# default; local `make backends/llama-cpp-localai-paged`). +# +# The install script is the same one that backend/Dockerfile.base-grpc-builder +# runs, so the result is bit-equivalent to the prebuilt-base path +# (builder-prebuilt below). +# ============================================================================ +FROM ${BASE_IMAGE} AS builder-fromsource +ARG BUILD_TYPE +ARG CUDA_MAJOR_VERSION +ARG CUDA_MINOR_VERSION +ARG CMAKE_FROM_SOURCE=false +# CUDA Toolkit 13.x compatibility: CMake 3.31.9+ fixes toolchain detection/arch table issues +ARG CMAKE_VERSION=3.31.10 +ARG GRPC_VERSION=v1.65.0 +ARG GRPC_MAKEFLAGS="-j4 -Otarget" +ARG SKIP_DRIVERS=false +ARG TARGETARCH +ARG TARGETVARIANT +ARG GO_VERSION=1.25.4 +ARG UBUNTU_VERSION=2404 +ARG APT_MIRROR +ARG APT_PORTS_MIRROR +ARG AMDGPU_TARGETS="" +ARG BACKEND=rerankers +# CUDA target archs, e.g. --build-arg CUDA_DOCKER_ARCH='75;86;89;120' +ARG CUDA_DOCKER_ARCH +ARG CMAKE_ARGS + +ENV BUILD_TYPE=${BUILD_TYPE} \ + CUDA_MAJOR_VERSION=${CUDA_MAJOR_VERSION} \ + CUDA_MINOR_VERSION=${CUDA_MINOR_VERSION} \ + CMAKE_FROM_SOURCE=${CMAKE_FROM_SOURCE} \ + CMAKE_VERSION=${CMAKE_VERSION} \ + GRPC_VERSION=${GRPC_VERSION} \ + GRPC_MAKEFLAGS=${GRPC_MAKEFLAGS} \ + SKIP_DRIVERS=${SKIP_DRIVERS} \ + TARGETARCH=${TARGETARCH} \ + UBUNTU_VERSION=${UBUNTU_VERSION} \ + APT_MIRROR=${APT_MIRROR} \ + APT_PORTS_MIRROR=${APT_PORTS_MIRROR} \ + AMDGPU_TARGETS=${AMDGPU_TARGETS} \ + CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} \ + CMAKE_ARGS=${CMAKE_ARGS} \ + DEBIAN_FRONTEND=noninteractive + +# CUDA on PATH (no-op when CUDA isn't installed) +ENV PATH=/usr/local/cuda/bin:${PATH} +# HipBLAS / ROCm on PATH (no-op when ROCm isn't installed) +ENV PATH=/opt/rocm/bin:${PATH} + +WORKDIR /build + +# Install everything via the shared script — the same one that +# backend/Dockerfile.base-grpc-builder runs, so the prebuilt CI base and +# this from-source path are bit-equivalent. +RUN --mount=type=bind,source=.docker/install-base-deps.sh,target=/usr/local/sbin/install-base-deps \ + --mount=type=bind,source=.docker/apt-mirror.sh,target=/usr/local/sbin/apt-mirror \ + bash /usr/local/sbin/install-base-deps + +# Mirror builder-prebuilt: copy gRPC from /opt/grpc to /usr/local so +# CMake's find_package finds it at the canonical prefix the Makefile expects. +RUN cp -a /opt/grpc/. /usr/local/ + +COPY . /LocalAI + +# BuildKit cache mount for ccache. See Dockerfile.llama-cpp (commit 9228e5b4) +# for rationale. llama-cpp-localai-paged is the SAME upstream llama.cpp with +# the LocalAI paged patch series applied; it reuses backend/cpp/llama-cpp +# source via a thin wrapper Makefile, so MOST TUs are content-identical to the +# stock llama-cpp build. Sharing a cache id with llama-cpp could give +# cross-variant hits — but for now keep them separate (mirroring turboquant) so +# a regression in one doesn't poison the other. Revisit sharing after measuring +# the actual hit rate. +# +# The compile body is shared with builder-prebuilt via .docker/llama-cpp-localai-paged-compile.sh. +RUN --mount=type=bind,source=.docker/llama-cpp-localai-paged-compile.sh,target=/usr/local/sbin/compile.sh \ + --mount=type=cache,target=/root/.ccache,id=llama-cpp-localai-paged-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \ + bash /usr/local/sbin/compile.sh + + +# Copy libraries using a script to handle architecture differences +RUN make -BC /LocalAI/backend/cpp/llama-cpp-localai-paged package + + +# ============================================================================ +# Stage: builder-prebuilt — uses the pre-built base from +# quay.io/go-skynet/ci-cache:base-grpc-* (built by .github/workflows/base-images.yml). +# That image already has gRPC at /opt/grpc + apt deps + CUDA/ROCm/Vulkan +# pre-installed, so we just copy gRPC to /usr/local and compile. Used when +# BUILDER_TARGET=builder-prebuilt (CI when the matrix entry sets +# builder-base-image). llama-cpp-localai-paged reuses the SAME base-grpc-* tags +# as the stock llama-cpp backend (same gRPC + same toolchain), so no new +# base-images.yml variant is required. +# ============================================================================ +FROM ${BUILDER_BASE_IMAGE} AS builder-prebuilt + +ARG BUILD_TYPE +ENV BUILD_TYPE=${BUILD_TYPE} +ARG CUDA_DOCKER_ARCH +ENV CUDA_DOCKER_ARCH=${CUDA_DOCKER_ARCH} +ARG CMAKE_ARGS +ENV CMAKE_ARGS=${CMAKE_ARGS} +# AMDGPU_TARGETS must be forwarded into the env here too — backend/cpp/llama-cpp/Makefile +# (which the llama-cpp-localai-paged Makefile reuses via a sibling build dir) errors out +# when the var is empty on a hipblas build, and the prebuilt path is what CI exercises most +# of the time. The builder-fromsource stage above already does this; mirror it here. +ARG AMDGPU_TARGETS +ENV AMDGPU_TARGETS=${AMDGPU_TARGETS} +ARG TARGETARCH +ARG TARGETVARIANT + +# The base-grpc-* image installs gRPC to /opt/grpc but doesn't copy it to +# /usr/local. Mirror what the from-source path does so the compile step +# can find gRPC at the canonical prefix the Makefile expects. +RUN cp -a /opt/grpc/. /usr/local/ + +COPY . /LocalAI + +RUN --mount=type=bind,source=.docker/llama-cpp-localai-paged-compile.sh,target=/usr/local/sbin/compile.sh \ + --mount=type=cache,target=/root/.ccache,id=llama-cpp-localai-paged-ccache-${TARGETARCH}-${BUILD_TYPE},sharing=locked \ + bash /usr/local/sbin/compile.sh + +RUN make -BC /LocalAI/backend/cpp/llama-cpp-localai-paged package + + +# ============================================================================ +# Final stage — copies package output from one of the two builders. +# BUILDER_TARGET selects which one. BuildKit prunes the unreferenced builder. +# +# BuildKit doesn't support variable expansion in `COPY --from=` directly, +# so we resolve the ARG by aliasing the chosen builder to a fixed stage +# name via `FROM ${BUILDER_TARGET} AS builder` and then COPY --from=builder. +# BUILDER_TARGET itself is declared as a global ARG at the top of this +# file (required for use in FROM), so we just re-import it into this +# stage's scope before the FROM directive. +# ============================================================================ +FROM ${BUILDER_TARGET} AS builder + +FROM scratch + + +# Copy all available binaries (the build process only creates the appropriate ones for the target architecture) +COPY --from=builder /LocalAI/backend/cpp/llama-cpp-localai-paged/package/. ./ diff --git a/backend/cpp/llama-cpp-localai-paged/Makefile b/backend/cpp/llama-cpp-localai-paged/Makefile new file mode 100644 index 000000000000..02449d902a49 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/Makefile @@ -0,0 +1,157 @@ + +# llama-cpp-localai-paged is LocalAI's paged-attention llama.cpp variant. It +# builds upstream llama.cpp with the LocalAI paged-attention patch series +# (patches/paged/, vendored in THIS backend) applied on top. It reuses +# backend/cpp/llama-cpp's grpc-server.cpp / CMakeLists.txt / prepare.sh / Makefile +# sources verbatim via a thin wrapper - the stock llama-cpp backend is pure +# upstream and carries NONE of the paged patches; this backend OWNS them. +# +# Pin handling (mirrors the turboquant wrapper, the precedent this is modelled +# on): the paged patch series is hand-verified bit-exact against ONE specific +# llama.cpp tip and re-exported by the manual PIN_SYNC process +# (README section 7 + .agents/llama-cpp-localai-paged-backend.md). A naive +# pin bump would move the tip out from +# under the patches and break `git apply` at build time, so this backend OWNS +# its pin (LLAMA_VERSION below) instead of inheriting the auto-bumped stock pin +# from backend/cpp/llama-cpp/Makefile. The override is forced into every copied +# build via `LLAMA_VERSION=$(LLAMA_VERSION)`. There is deliberately NO +# bump_deps.yaml entry for it: it is advanced ONLY by PIN_SYNC, never nightly. +# (turboquant CAN auto-bump because its fork branch carries the patches; the +# paged series is vendored as .patch files here, so it cannot.) +# +# - NO patch-grpc-server.sh and NO apply-patches.sh: the shared grpc-server.cpp +# already carries the (runtime-gated) paged option hooks, and the paged patch +# series (patches/paged/) is applied by THIS Makefile's own apply step onto +# the freshly cloned tree, using the same strict `git apply` method the stock +# build uses for base patches. The stock llama-cpp Makefile applies only its +# own (currently empty) base patches/ series, never the paged one. + +# Manually pin-synced llama.cpp tip the paged patch series is verified against. +# Decoupled from the auto-bumped stock pin in backend/cpp/llama-cpp/Makefile so +# the nightly llama.cpp bump cannot silently break the vendored paged patches. +# Advance ONLY via the PIN_SYNC process (rebase patches + bit-exact gate + +# re-export), then update this value. See: +# README section 7 + .agents/llama-cpp-localai-paged-backend.md +# +# This pin = the manual, verified sync. The signal telling you WHEN to do the +# next sync is the early-warning canary +# (.github/workflows/llama-cpp-paged-canary.yml): weekly it applies + compiles +# this patch series against the latest upstream llama.cpp tip and goes red the +# moment upstream drifts past the patches. Canary red -> run a PIN_SYNC, then +# bump this value. The canary never touches this pin; it is signal-only. +# +# HARD CONSTRAINT: keep this == the stock llama-cpp pin (backend/cpp/llama-cpp/ +# Makefile). grpc-server.cpp is SHARED with the stock backend and tracks the +# stock pin; a paged pin that diverges PAST an upstream server-API refactor +# breaks the grpc-server LINK even when the patches are byte-for-byte bit-exact. +# The c299a92c bump did exactly this: patches applied + greedy-md5 bit-exact, but +# grpc-server.cpp failed to link with undefined references to stream_* server +# helpers that the refactor pulled into the headers grpc-server.cpp includes. +# Therefore a PIN_SYNC must pass the FULL grpc-server build/link on CI, not only +# the bit-exact gate. See README section 7 + .agents/llama-cpp-localai-paged-backend.md. +LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd + +CMAKE_ARGS?= +BUILD_TYPE?= +NATIVE?=false +ONEAPI_VARS?=/opt/intel/oneapi/setvars.sh +TARGET?=--target grpc-server +JOBS?=$(shell nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 1) +ARCH?=$(shell uname -m) + +CURRENT_MAKEFILE_DIR := $(dir $(abspath $(lastword $(MAKEFILE_LIST)))) +LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp +# OUR vendored paged-attention patch series. Owned by this backend; the stock +# llama-cpp backend no longer carries it. Applied onto each freshly cloned +# llama.cpp tree by apply-paged-patches below (strict git apply). +PAGED_PATCHES_DIR := $(CURRENT_MAKEFILE_DIR)/patches/paged + +GREEN := \033[0;32m +RESET := \033[0m + +# Apply OUR vendored paged-attention patch series (patches/paged/0*.patch) onto a +# freshly cloned llama.cpp tree ($(1)) using the SAME strict git-apply method the +# stock build uses for its base patches (backend/cpp/llama-cpp/Makefile `llama.cpp` +# target). Strict: any patch that no longer applies aborts the build (exit 1) - +# that is the signal to run a PIN_SYNC, never to bump the pin blindly. The series +# is owned by THIS backend, not by the now-pure stock llama-cpp backend. +define apply-paged-patches + cd $(1) && \ + for p in $(PAGED_PATCHES_DIR)/0*.patch; do \ + [ -e "$$p" ] || continue; \ + echo "applying llama.cpp PAGED patch: $$p"; \ + git apply --verbose "$$p" || { echo "paged patch failed: $$p"; exit 1; }; \ + done +endef + +# Each flavor target: +# 1. copies backend/cpp/llama-cpp/ (grpc-server.cpp + prepare.sh + +# CMakeLists.txt + Makefile) into a sibling +# llama-cpp-localai-paged--build directory; +# 2. clones OUR pinned upstream llama.cpp into that copy via the copy's own +# `llama.cpp` target (which applies the stock base patches/ series, normally +# empty), then applies THIS backend's paged patch series (patches/paged/) +# onto the cloned tree with strict `git apply` (apply-paged-patches); +# 3. runs the copy's `grpc-server` target and copies the produced binary up as +# llama-cpp-localai-paged-. +# We clone+patch only the *copy*, never the original under backend/cpp/llama-cpp/, +# so the stock llama-cpp build stays untouched and patch-free. +define paged-build + rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build + cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build purge + $(info $(GREEN)I llama-cpp-localai-paged build info:$(1)$(RESET)) + LLAMA_VERSION=$(LLAMA_VERSION) $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build llama.cpp + $(call apply-paged-patches,$(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/llama.cpp) + CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_VERSION=$(LLAMA_VERSION) \ + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build grpc-server + cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-$(1)-build/grpc-server llama-cpp-localai-paged-$(1) +endef + +llama-cpp-localai-paged-avx2: + $(call paged-build,avx2,-DGGML_AVX=on -DGGML_AVX2=on -DGGML_AVX512=off -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server) + +llama-cpp-localai-paged-avx512: + $(call paged-build,avx512,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=on -DGGML_FMA=on -DGGML_F16C=on,--target grpc-server) + +llama-cpp-localai-paged-avx: + $(call paged-build,avx,-DGGML_AVX=on -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server) + +llama-cpp-localai-paged-fallback: + $(call paged-build,fallback,-DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server) + +# Single-build CPU backend via ggml CPU_ALL_VARIANTS (mirrors llama-cpp-cpu-all). +# Reuses backend/cpp/llama-cpp's CMakeLists.txt (hw_grpc_proto STATIC) and +# Makefile (SHARED_LIBS make-var + EXTRA_CMAKE_ARGS), so this passes the same +# overrides through to the copied build: SHARED_LIBS=ON, the DL flags, and +# --target ggml (which pulls in the per-microarch libggml-cpu-*.so via ggml's +# add_dependencies). The .so set is collected for package.sh to bundle into +# package/lib. +llama-cpp-localai-paged-cpu-all: + rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build + cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build purge + $(info $(GREEN)I llama-cpp-localai-paged build info:cpu-all-variants$(RESET)) + LLAMA_VERSION=$(LLAMA_VERSION) $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build llama.cpp + $(call apply-paged-patches,$(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/llama.cpp) + SHARED_LIBS=ON EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" TARGET="--target grpc-server --target ggml" LLAMA_VERSION=$(LLAMA_VERSION) \ + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build grpc-server + cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/grpc-server llama-cpp-localai-paged-cpu-all + rm -rf ggml-shared-libs && mkdir -p ggml-shared-libs + find $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-cpu-all-build/llama.cpp/build \( -name '*.so*' -o -name '*.dylib' \) -exec cp -av {} ggml-shared-libs/ \; + @echo "Collected ggml shared backends:" && ls -la ggml-shared-libs/ + +llama-cpp-localai-paged-grpc: + $(call paged-build,grpc,-DGGML_RPC=ON -DGGML_AVX=off -DGGML_AVX2=off -DGGML_AVX512=off -DGGML_FMA=off -DGGML_F16C=off -DGGML_BMI2=off,--target grpc-server --target ggml-rpc-server) + +llama-cpp-localai-paged-rpc-server: llama-cpp-localai-paged-grpc + cp -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-grpc-build/llama.cpp/build/bin/ggml-rpc-server llama-cpp-localai-paged-rpc-server + +package: + bash package.sh + +purge: + rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-localai-paged-*-build + rm -rf llama-cpp-localai-paged-* package + +clean: purge diff --git a/backend/cpp/llama-cpp-localai-paged/README.md b/backend/cpp/llama-cpp-localai-paged/README.md new file mode 100644 index 000000000000..9a4b81215b47 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/README.md @@ -0,0 +1,699 @@ +# LocalAI paged-attention llama.cpp patch series + +This backend vendors the patch series (in `patches/paged/`) that turns stock +llama.cpp into LocalAI's paged-attention variant (`llama-cpp-localai-paged`). The +patches are applied on top of a pinned upstream llama.cpp at build time; nothing +here is a fork - it is a source-only `*.patch` stack plus this canonical doc. + +> One-file rule: this README is the canonical reference for the patch series. The +> only other docs are operational, kept in `docs/`, and linked below: +> - [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md) - the per-path bit-exactness gate (the canonical paged-MoE md5 reference). +> - [`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md) - the design-of-record for shipping this as its own backend + the NVFP4 gallery items. +> - [`VLLM_PARITY_FINAL.md`](docs/VLLM_PARITY_FINAL.md) - the definitive, closed record of the GB10 vLLM-parity investigation: full benchmark, every lever + verdict, the structural floors, and the parity verdict (summarized in section 9 below). Read this before reopening any parity work. +> - [`EXECUTION_REARCH_SCOPE.md`](docs/EXECUTION_REARCH_SCOPE.md) - the reopened scope: ports vLLM's execution *architecture* (bf16-resident stream, expert-major fused MoE region, persistent-CTA GEMM, token-budget scheduler, blocked-solve GDN) into the fork additively, on the thesis that same-silicon 2-3x is software-architecture-conditional, not a hardware floor. Phased (P1-P6), each with a falsifiable P0 kill-gate. Read this to pick up parity work after `VLLM_PARITY_FINAL.md`. + +--- + +## 1. What it is + +`llama-cpp-localai-paged` is the LocalAI paged-attention llama.cpp backend: a +vendored patch series over upstream llama.cpp that adds + +- a **paged KV cache** (vLLM-style block manager: on-demand fixed-size blocks, + free pool, ref-counted blocks) with a **block-table flash-attention** read so + the attention kernels index physical cells instead of a contiguous buffer; +- **cross-request prefix sharing** - concurrent requests that share a long + prefix physically reuse one committed copy of the prefix blocks and prefill + only their divergent suffix; +- a **decode-first prefill scheduler** - a dynamic per-step prefill-token budget + decoupled from `n_batch`, so a long prefill never freezes co-batched decode; +- **GB10 / Blackwell NVFP4 decode optimizations** for the Qwen3.6 hybrid + gated-DeltaNet (SSM) models, where the recurrent-state plumbing - not the FP4 + GEMM - dominates the decode step. + +It is **pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` backend's +pin) and advanced only by a manual, bit-exact-gated pin-sync process (see +section 7, "Pin + maintenance policy"), decoupled from the nightly auto-bumper. The pin must stay aligned with the stock pin because +`grpc-server.cpp` is shared; an earlier bump to `c299a92c` was bit-exact but broke +the grpc-server link and was reverted to the then-current stock pin. + +The build gate is `LLAMA_PAGED` (default on in this tree); the paged engine is +enabled per-model at runtime via the gallery `options:` knobs (`paged_kv:true`, +`max_batch_tokens:`, `kv_unified:false`, ...). Against unpatched llama.cpp the +runtime hooks are inert, so a single `grpc-server.cpp` is shared between the +clean and the paged build. + +--- + +## 2. Architecture + +The decode step on these models breaks into three cost centers; the patch series +attacks each one. + +**Paged KV manager + block-table flash-attn.** A host-side `PagedKVManager` +(`FreeBlockQueue` / `BlockPool` / chained-hash content cache) hands out +fixed-size KV blocks on demand and reclaims them per-sequence (ref-counted, with +copy-on-write for shared prefixes). The attention path reads through a **block +table** - an `I32 [n_view, n_stream]` position-ordered physical-cell index passed +as `src[5]` of `ggml_flash_attn_ext` - so the CUDA fattn vec/tile kernels and the +CPU reference map logical KV index `j` to physical cell `block_table[seq*ne11+j]` +and read K/V in place. Token-position ordering keeps the flash-attn online-softmax +reduction order identical to stock. A null block table is the stock contiguous +read, byte-identical. + +**The gated-DeltaNet (GDN / SSM) decode path.** The Qwen3.6 hybrid models are 48 +gated-DeltaNet (linear-attention / SSM) layers + 16 full-attention layers. On +GB10 the recurrent-state plumbing, not the weight GEMM, is the dominant decode +cost. The series fuses that plumbing to mirror vLLM's +`fused_recurrent_gated_delta_rule`: the recurrent state is read from and written +to its cache slot in place (no copy-back, no `get_rows` materialization), the +conv state is updated in place, the output projection is reshaped to route to the +tensor-core MMQ GEMM, and the recurrence kernel is occupancy-retuned - all +bit-exact (md5-gateable) against the f32 baseline. + +**NVFP4 native FP4-MMA on Blackwell.** The NVFP4 dense/expert weight GEMM uses +Blackwell's native FP4-MMA. The series removes a redundant activation-requantize +in the MoE broadcast projections (bit-exact byte copy of identical blocks) and +keeps CUDA graphs on for the grouped-MMQ MoE decode step. These are the only +NVFP4-specific optimizations; on non-Blackwell hardware the FP4 path falls back +to dequant. + +**The prefill/decode scheduler.** `update_slots()` already emits one unified +mixed prefill+decode batch per step. The scheduler patches change only the *count* +of prefill tokens admitted per step: decode tokens are claimed first +(decode-first), then a dynamic budget `max(n_ubatch, T - D)` (where `D` is the +live decode load and `T` is `LLAMA_MAX_BATCH_TOKENS`) admits prefill, auto- +shrinking as decode load rises. Pure scheduler policy, byte-identical when off, +orthogonal to the paged allocator. + +--- + +## 3. Patch series (0001-0063) + +Source-only patches, with intentional numbering gaps (e.g. 0005, 0027). The +decode-serving graph-reuse levers are 0040-0041. "Bit-exact" = greedy md5 / +`test-backend-ops` byte-identical to the relevant baseline; the gate methodology +is in section 5. + +### Paged-KV core (0001-0012) + +| # | What it does | Bit-exact | +|---|---|---| +| 0001 | Vendor the host-side paged KV block manager (`FreeBlockQueue`, `BlockPool`, `PagedKVManager`, chained-hash prefix cache). Pure C++17, nothing uses it yet. | n/a (no behavior) | +| 0002 | Place each sequence at permuted, non-contiguous block positions in `find_slot` (proves attention is invariant to physical KV placement). | yes (token-identical) | +| 0003 | Gather K/V/mask down to each stream's non-empty cells before `build_attn_mha`, position-sorted so the FA reduction order matches stock. | yes | +| 0004 | Drive paged placement through the vendored manager: blocks popped on demand, returned on seq end. Core kv-cache struct untouched. | yes (stock path byte-identical) | +| 0006 | Host-side cross-request prefix caching: hash prefix blocks, reuse matching physical blocks (ref-count++), COW-privatise before a divergent write. | yes (default off) | +| 0007 | Wire the prefix cache into the engine so a new sequence physically shares cached prefix blocks and skips recomputing the shared prefix. | yes (verified byte-identical) | +| 0008 | Wire cross-request prefix share into the llama-server continuous-batch loop so concurrent shared-prefix requests prefill only the suffix (36x fewer prefill tokens at K=32). | within CUDA batch-shape non-determinism band | +| 0009 | Replace the per-step gather with an **in-kernel paged read** (block table as `src[5]`); the K/V `get_rows` is gone. Decode step at batch32 691->696ms (was 1279ms gathered). | yes on CPU/batch1; GPU batch>1 within vec-vs-mma band | +| 0010 | Graft the block-table read into the tile kernel; add a dispatch guard so a present block table routes ONLY to vec/tile (never the mma/wmma kernels that ignore it). | yes (CPU byte-identical; vec route) | +| 0011 | Route the GQA-grouped F16 decode to the **tile kernel** (native head-group reuse) by default; vec for everything else. Paged decode to within 1.8% of stock. | vs stock-mma: different-kernel rounding; bit-exact vs vec | +| 0012 | Defensive `GGML_ASSERT(n_view % 64 == 0)` so a future pad/tile change can't silently reintroduce a past-end KV leak on the tile route. | yes (additive assert) | + +### Decode-first scheduler (0013, 0016) + +| # | What it does | Bit-exact | +|---|---|---| +| 0013 | `LLAMA_PREFILL_BUDGET`: a static per-step prefill-token budget decoupled from `n_batch` (vLLM `--max-num-batched-tokens` analogue). Flattens the decode ITL spike a long prefill inflicts (8.5x smaller worst freeze). | yes (off/short = byte-identical; == `-b` chunking) | +| 0016 | Supersede 0013 with a **dynamic decode-first** budget: `max(n_ubatch, T-D)`, auto-shrinking as decode load `D` rises. Policy-only inside `update_slots()`, zero libllama changes. | yes (default-off byte-identical) | + +(0014/0015 are the MoE token-tile levers: 0014 adds `LLAMA_MOE_MMQ_X` (opt-in +high-batch decode micro-opt, +4.8% on Qwen3-Coder-30B), 0015 makes it a +default-on, density-aware auto-select that is prefill-safe by construction. Both +bit-exact. 0017 is the dense FP4-GEMM occupancy-tune track: bit-exact gate green, +but every cheap occupancy lever regressed on GB10, so nothing is enabled - it +ships as the parity gate + default-off instrumentation only.) + +### Decode-serving graph reuse (0040, 0041) + +These two close the **continuous-serving** decode gap (distinct from the static +batched-bench decode kernel, which is already at vLLM parity - see +[`docs/DECODE_SERVING_SCOPE.md`](docs/DECODE_SERVING_SCOPE.md)). In serving the +host rebuilt the ggml graph on **every** decode step (layer-A graph reuse was 0%), +so the GPU idled while the host rebuilt - the host-bound -39% the static bench +hides. + +| # | What it does | Bit-exact | +|---|---|---| +| 0040 | **S1 paged decode-graph reuse** - the paged decode inputs (`input_block_table` / `input_gather_idxs`) never overrode `can_reuse` (defaults to false), so any graph carrying a paged input could never be reused. Add a correct `can_reuse` keyed on the (256-bucketed) block-table dims + a live-mctx refresh from the owning attn input. `LLAMA_PAGED_NO_GRAPH_REUSE=1` forces the pre-S1 path. | yes (md5 byte-identical reuse on/off; dense `5951a5b4`, paged-MoE `8cb0ce23`) | +| 0041 | **S3 decode-shape-stable scheduling** - keep co-batched prefill OUT of decode steps so the pure-decode batch shape stays reuse-stable (S1 makes a pure-decode step reusable; S3 makes the scheduler emit them). Pure `update_slots()` policy on top of 0016; prefill admitted on a bounded cadence (`LLAMA_PAGED_PREFILL_PERIOD`, default 8). **Default OFF** (opt-in via `LLAMA_PAGED_DECODE_STABLE=1`): a measured end-to-end A/B proved default-on is a serving mistake - deferring prefill admission on the period-8 cadence gives **2.5x worse TTFT** (60s vs 24s at N=256) and **20-29% lower end-to-end throughput**, with no end-to-end win at any concurrency; its apparent `decode_agg` gain was a metric artifact (faster per-step decode bought by starving prefill). Default prefers prompt prefill admission for good TTFT; opt in only for decode-dominated, low-arrival traffic where TTFT is not a concern. | yes (byte-identical on/off; per-stream independent in serving) | + +Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): +graph reuse **0% -> 72.2%**, host window `hostproc` **15.98 -> 6.31 ms/step**, +decode **4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9 +sustained)**. S1 is necessary but **not** sufficient alone (13.8% reuse - prefill +co-batching churns the shape nearly every step); S3 is the multiplier of that +per-step decode metric. **But those are per-step decode numbers, not an end-to-end +serving win**: a later end-to-end A/B showed S3-default-on regresses real serving +(2.5x worse TTFT, 20-29% lower end-to-end throughput, no win at any concurrency), +because the period-8 cadence defers prefill admission. So **only S1 (0040) ships +default-on; S3 (0041) now defaults OFF and is opt-in** (`LLAMA_PAGED_DECODE_STABLE=1`, +for decode-dominated low-arrival traffic). The static batched-bench A/B isolates the S1 +mechanism: paged decode reuse 0% -> 95.5% (throughput flat there, since the static +regime is GPU-bound). **S2 (double-buffer `set_inputs`) was dropped**: the Phase-0 +profile put `set_inputs` at ~0.05 ms/step (the cost is the rebuild, not the input +copy), so it has nothing to recover. The remaining ~28% serving rebuilds are +request-boundary D/seq-set churn + the prefill-cadence steps. A **padded/fixed-slot +decode shape** to capture them was then implemented and GPU-tested (2026-06-28) and +**REJECTED** - it is bit-exact/inert but regresses serving throughput at every +concurrency, because this serving decode is GPU-compute-bound (baseline reuse 0% ~= +S1+S3 reuse 72% on aggregate tok/s), so the dummy-row compute it adds costs more +than the reuse it recovers. Full record + numbers in `docs/DECODE_SERVING_SCOPE.md` +("Padded-shape lever - rejected"). + +### Prefill fusions (0042, 0044) + +CUDA-family graph fusions of the pre-norm residual chain and the gated-DeltaNet +output norm: separate `rms_norm` / `mul` / `add` / `silu` launches collapse into +one kernel so the intermediate never round-trips to HBM. Bit-exact (the fused +kernel reproduces the unfused FP order; float multiply is commutative). Each is +env-gated default-ON (`LLAMA_FUSE_*=0` for a clean single-build A/B that reverts +to the byte- and kernel-identical unfused path). + +| # | What it does | Bit-exact / effect | +|---|---|---| +| 0042 | **Fused residual-add + RMS norm + weight multiply** (`rms_norm_pre_add_mul_f32`) - the pre-norm residual `h = x + sub_out; n = rms_norm(h) * w` ran as a `k_bin_bcast` ADD feeding the fused rms_norm+mul; the residual ADD has a second consumer (the skip add) so it can't pass the single-use `ggml_can_fuse`. Recognized via `ggml_can_fuse_subgraph` (ADD + final MUL both outputs), folded into one launch that publishes `h` and emits `scale * h * w`. Gate `LLAMA_FUSE_ADD_RMSNORM`. | yes (dense `5951a5b4`, MoE `8cb0ce23`); dense S_PP +0.5% | +| 0044 | **Fused gated RMSNorm + SiLU gate multiply** (`rms_norm_gate_mul_f32`) - the gated-DeltaNet output norm `(rms_norm(x) * w) * silu(z)` (qwen35 / qwen35moe `build_norm_gated`) ran as rms_norm_mul + silu_mul, two launches with the normalized intermediate crossing HBM. The gate z-projection (a MUL_MAT) is scheduled between the weight MUL and the SILU, so the chain is not naturally consecutive; `build_norm_gated` emits the gate multiply as `mul(silu(z), normalized)` (commutative, bit-exact) so the graph lays out the consecutive subgraph `{ SILU, RMS_NORM, MUL, MUL }` that `ggml_cuda_can_fuse` folds into one `scale * x * w * silu(z)` launch. Gate `LLAMA_FUSE_GATE_RMSNORM`. Profile (dense npp512): 672 (rms_norm_mul + silu_mul) -> 336 fused launches. | yes (dense `5951a5b4`, MoE `8cb0ce23`, paged + non-paged; `test-backend-ops` 12979/12979); S_PP dense +1.1% (~+10 us/tok), MoE +0.9% | + +### SSM (gated-DeltaNet) decode levers (0018-0022, 0028) + +These are the dominant decode levers on the Qwen3.6 hybrid models. All bit-exact. + +| # | What it does | Effect (dense q36-27b / MoE q36-35b-a3b @npl128) | +|---|---|---| +| 0018 | **In-place SSM state write-back** - the recurrence writes its final state directly into the cache slot, removing the ~225MB/copy D2D memcpy (18.9% of decode time). | dense +23.5% / MoE +18.9% | +| 0019 | **Fused recurrent-state gather** - the op reads each sequence's prior state directly from `cache[ids[seq]]` (no `get_rows` materialization); race-free in-place + ids read. | dense +37.8% / MoE +35.3% | +| 0020 | **o_proj MMVQ->MMQ reshape** - collapse the GDN output to 2D so the output projection routes to the M=128 tensor-core MMQ GEMM (was a batch<=8 MMVQ GEMV). The single biggest decode-parity lever. | dense +31.7% (->85.9% of vLLM) / MoE +23.3% | +| 0021 | **Conv-state in-place fusion** - one `ggml_ssm_conv_update_inplace` op replaces the 4-op conv chain (transpose+concat+conv+silu+ring-cpy), writing the shifted ring state in place. | dense +3.2% / MoE +3.5% | +| 0022 | **GDN recurrence occupancy/coalescing retune** - column-folding (NUM_WARPS/COLS_PER_WARP) raises memory-level parallelism on the bandwidth-bound B=128 recurrence kernel; per-column f32 FMA order unchanged. 73.4%->84.6% of GB10 peak BW. | dense +11.1% / MoE +8.3% | +| 0028 | **Recurrent conv-tap gather fusion** - the last `k_get_rows` in the GDN decode path (the conv-state tap gather) becomes an indexed in-kernel read. | dense ~377 t/s / MoE ~784 t/s | + +### MoE NVFP4 quant (0023, 0025, 0043) + +| # | What it does | Bit-exact | +|---|---|---| +| 0023 | **NVFP4 activation-quantize de-dup** - the broadcast up/gate projections re-quantize the same token activation once per expert; quantize the unique token activations once and byte-copy them into the expert-gathered layout. The only NVFP4-specific patch. | yes (byte-identical) | +| 0025 | **MoE decode re-graph** - keep CUDA graphs on for the grouped-MMQ MoE decode step (the upstream guard disables graphs conservatively; the grouped path has no host sync). Was env-gated `LLAMA_MOE_FORCE_GRAPHS`; now ON by default via 0043. | yes (graph replay re-issues identical kernels) | +| 0043 | **MoE decode graph default-on (D1)** - flip 0025 to ON by default: capture/replay the full-step decode CUDA graph (incl. the grouped-MMQ MoE dispatch) instead of re-issuing every kernel each step. Guard is `should_use_mmq()` (FALSE for the large-M NVFP4 prefill of 0034, so prefill keeps graphs disabled - its per-expert host-loop genuinely syncs). `LLAMA_MOE_NO_FORCE_GRAPHS=1` forces the conservative pre-0025 disable for A/B. D1 profiling: the per-expert host-loop (the only device->host MoE-routing readback) is never hit on the NVFP4 grouped path (sync count identical graphs on/off); steady decode is ~99% GPU-busy, so the cost removed is per-step host kernel RE-ISSUE, not a sync. | yes (md5 byte-identical default/off/forced; paged-MoE `8cb0ce23`, dense `5951a5b4`) | + +### Pool reclaim, block-table cache, backend gate + +| # | What it does | Bit-exact | +|---|---|---| +| 0024 | **Paged-pool burst-reclaim** - truncate trailing blocks on partial-tail `seq_rm`, defrag the free queue when idle, release blocks on slot completion. Fixes the long-server burst-degradation bug (post-burst prefill collapse 488->44 t/s, restored to 532). Host-side accounting only. | yes | +| 0029 | **Block-table within-step host cache** - the block table is fixed for the whole step; cache it on first build and memcpy it for the other full-attention layers (get_block_table -87%/-91%). | yes, per path (paged-MoE ref `8cb0ce23`) | +| 0030 | **Fused-op backend gate** - the fused GDN / discriminated SSM_CONV ops are CUDA-family + CPU only; force them off on any non-CUDA compute backend so a Vulkan/SYCL/Metal build can't silently run the wrong plain-conv kernel. | yes on CUDA (byte-identical pre-0030); safety gate elsewhere | +| 0031 | **Chunked parallel-scan GDN prefill kernel** (upstream TODO) - FLA-style chunked gated-delta-rule for prefill (non-KDA / f32 / final-state): intra-chunk delta rule solved in parallel (UT-transform + forward subst), inter-chunk recurrence over n_tokens/C steps. The scalar-serial form (`GDN_TC=0`) was bit-exact-benign but not faster than the tuned sequential scan at the GB10-forced C=16 (see section 5); **superseded for paged by the tensor-core M5 path of 0047**. | NEW per-path (`test-backend-ops` 91/91, <=1e-7 NMSE vs CPU ref) | +| 0047 | **GDN M5 tensor-core chunked-scan prefill, f32-only re-port, default-ON under paged KV** - the f32/tf32 tensor-core forms of 0031's scan (KK/QK Gram = M2, KS/QS state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update mma = M5), single build, runtime-selected by `GDN_TC`. Ships **M5 default-on when `LLAMA_KV_PAGED` is set** (`GDN_TC=5` + `GDN_CHUNK_MIN=64`, both env-overridable; OFF/`INT_MAX` when not paged). `GDN_CHUNK_MIN` is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence (at 1 it swallows decode and drops S_TG ~25%); 64 tuned from a {1,32,64,128,256} sweep. The bf16/hybrid dev-tree machinery (STATE_BF16/HYBRID, the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C (M8) plus register-resident M6/M7 variants are NOT part of this f32-only series. MoE prefill S_PP +3.5% @npp512 (3x A/B), +17.7% @npp2048; decode S_TG unchanged. | NEW per-path, benign (`test-backend-ops` GATED_DELTA_NET 46/46 default AND force-M5, incl. multi-chunk/tail-chunk/multi-seq; greedy md5 default-on == M5-forced == canonical on the gate prompt: paged-MoE `8cb0ce23`, dense `5951a5b4`; long MoE prompt = one benign greedy flip vs sequential, dense byte-identical) | +| 0046 | **GDN prefill geometry gated by scan length** - patch 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN sequential-recurrence dispatch (`case 128`) is a decode win but was applied UNCONDITIONALLY, so it also hit dense prefill (~-6% vs stock): on a long sequential scan the launch `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1` and the SMs starve (profiled: `gated_delta_net` +54% GPU time = the whole dense-prefill regression). Gate the geometry by per-call scan length: long scans (prefill, `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` geometry; short scans (decode) keep the `(16,8)` retune. Recovers dense prefill +7.2% back to stock parity, keeps the decode win. `GDN_PREFILL_NTOK` tunes the crossover; an explicit `GDN_NW`/`GDN_CPW` sweep still overrides (gate yields when either is set), so the one-build %peak A/B harness is unchanged. | yes (patch 0022 proved every `{NW,CPW}` variant byte-identical, so switching geometry by scan length cannot move the md5) | + +### Speculative / MTP investigation (0054, 0055) + +| # | What it does | Bit-exact / effect | +|---|---|---| +| 0054 | **Disable backend sampling for MTP drafts** - forces server MTP draft generation through the target-side sampler acceptance path instead of letting the draft backend sample independently. This was required for the Phase 14 rollback/prefix safety gate. | yes for canonical non-MTP gates; Phase 14 MTP normalized greedy-prefix gate passed | +| 0055 | **Trace speculative batch shapes** - adds default-off `LLAMA_SPEC_SHAPE_TRACE=1` server logs around `server_slot::handle_last_sampled_token()`, reporting normal decode rows and MTP verification `K + 1` rows (`draft`, `outputs`, `spec_i_first`, `spec_i_last`). This is instrumentation only for Phase 18 shape-entropy measurement before any scheduler experiment. | yes (env unset is silent; DGX gates after patch: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`) | +| 0056 | **Trace MoE MMQ batch shapes** - adds default-off `LLAMA_MOE_MMQ_SHAPE_TRACE=` logs from the grouped-MMQ host selector, reporting routed assignment count, estimated active experts, density, selected `mmq_x`, `mmq_y`, and stream-k. This is evidence-only instrumentation for sizing structural grouped-MMQ work after Phase 28 rejected launch-bounds/row-tile knobs. | yes (env unset and trace-enabled gates both green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; trace cap verified with 4 lines) | +| 0057 | **Trace MoE MMQ launch shapes** - extends `LLAMA_MOE_MMQ_SHAPE_TRACE=` with bounded `[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`, recording actual `ntiles_dst`, `stream_k_blocks`, tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. This is evidence-only instrumentation to distinguish real stream-k/fixup overhead from small-M kernel-shape cost. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 31 n128 trace showed decode and prefill `fixup=0`, `stream_k_blocks == ntiles_dst`) | +| 0058 | **Trace MoE small-M MMQ candidates** - adds `LLAMA_MOE_MMQ_SMALL_M_TRACE=` and a host-only classifier for decode-like low-density grouped-MMQ shapes (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). It only counts candidate calls for the next structural tile-policy A/B; no numeric branch is added. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 32 n128 trace found 4096 candidates, mostly `mmq_x_best=64/48`) | +| 0059 | **Gate MoE small-M MMQ tile policy** - adds default-off `LLAMA_MOE_SMALL_M_TILE=` to cap only classified small-M MoE grouped-MMQ calls. This was used to A/B vLLM-like smaller M blocks without changing default inference. | yes (default-off, tile16, tile8, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 33 rejected tile16 and tile8 as slower) | +| 0060 | **Trace MoE MMID dispatch routes** - adds default-off `LLAMA_MOE_MMID_ROUTE_TRACE=` around `MUL_MAT_ID` dispatch, classifying each call as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync `fallback`. This is evidence-only instrumentation to resolve whether serving hits the per-expert host-sync fallback. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID` `806/806`; Phase 34 n128 trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`) | +| 0061 | **Trace regular MUL_MAT dispatch routes** - adds default-off `LLAMA_MUL_MAT_ROUTE_TRACE=` around regular `MUL_MAT`, classifying projection-heavy calls as `vec_f`, `mat_f`, `vec_q`, `mmq`, `batched_cublas`, `op_*`, `fp4_prefill`, or `fwht`. This is evidence-only instrumentation for the `bf16-proj` serving bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 35 n128 trace found BF16 routes `mat_f=2485`, `op_cublas=1330`) | +| 0062 | **Trace cuBLAS subroutes** - adds default-off `LLAMA_CUBLAS_ROUTE_TRACE=` around the generic cuBLAS `MUL_MAT` path, classifying calls as `nvfp4_bf16_tc`, `bf16_tc`, `f16_tc_32f`, `f16_tc_16f`, or `sgemm`. This is evidence-only instrumentation for the Phase 35 `op_cublas` bucket. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 36 n128 trace found `bf16_tc=5681`, `sgemm=2511`) | +| 0063 | **Trace cuBLAS tensor names** - extends `LLAMA_CUBLAS_ROUTE_TRACE=` with `src0`, `src1`, and `dst` names so the `sgemm` bucket can be tied back to graph nodes. | yes (default-off, trace-enabled, and post-serving gates green: MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` `806/806`; Phase 37 n128 trace identified `sgemm` as `ffn_gate_inp* -> ffn_moe_logits/shared_expert_gate`) | + +> **Dropped: patch 0026 (hybrid per-head bf16 SSM state, `ssm_bf16_tau`).** Once +> the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) +> landed, the bf16-SSM lever bought nothing: a clean re-measurement forcing **all** +> gated-DeltaNet heads to bf16 (`tau=100000`) gives **flat** decode (780.6 vs +> 780.0 t/s) - the mode engages but adds zero throughput because it is subsumed by +> the fusions. It was a precision trade (not bit-exact) plus extra bug surface and +> CUDA template-instantiation compile cost with no benefit, so it was removed. See +> section 5 ("rejected / flat levers") for the full record. + +--- + +## 4. Benchmarks + +Hardware: **GB10 / DGX Spark** (CUDA 13, sm_121). Models: dense +**Qwen3.6-27B-NVFP4** and MoE **Qwen3.6-35B-A3B-NVFP4**. Metric: `decode_agg` +S_TG (t/s) from `llama-batched-bench`, `-fa on -ngl 99`, `npp 128 / ntg 128`, +swept over serving width `npl` in {8, 32, 64, 128}. Plots: +[`qwen36_decode_overview.png`](docs/qwen36_decode_overview.png) (both models), +[`qwen36_dense_decode_vs_npl.png`](docs/qwen36_dense_decode_vs_npl.png), +[`qwen36_moe_decode_vs_npl.png`](docs/qwen36_moe_decode_vs_npl.png); raw data +[`final_benchmark.csv`](docs/final_benchmark.csv). + +![NVFP4 decode throughput vs concurrency on GB10: llama.cpp standard vs vLLM vs LocalAI's llama.cpp patches](docs/qwen36_decode_overview.png) + +> The plot above also shows a third "bf16-tau" llama curve. That was the opt-in +> `ssm_bf16_tau` lever (patch 0026), since **dropped** - a clean re-measurement +> showed it flat once the decode fusions landed (see section 5). The numbers below +> use only **stock** vs **patched** vs **vLLM**. + +> **What was re-measured (2026-06-27).** The two llama columns - **stock** and +> **patched** - were re-measured this session on one consistent +> `llama-batched-bench` harness. The **vLLM** column is the **prior-session +> reference** (kept as-is, *not* re-run this session). Per-run peak +> VRAM was *not* re-captured: the GB10's unified Grace-Blackwell LPDDR5x reports +> `[N/A]` to `nvidia-smi --query-gpu=memory.used` and the bench does not print it +> (the memory-advantage note below is the prior-session finding). + +### (a) + (b) Patched vs stock vs vLLM + +The **stock** column is a separate, unpatched llama.cpp built at this backend's +**exact pin (`9d5d882d`)**; the **patched** column is +the paged binary, env/flag-toggled (`LLAMA_KV_PAGED=1`, plus +`LLAMA_MOE_FORCE_GRAPHS=1` for MoE). Both +run on the **same harness**, so "x over stock" is an apples-to-apples measure of +the patch series. (Note: the patch series' dominant SSM decode fusions are +compiled in, not env-gated - toggling `LLAMA_KV_PAGED` alone on the *patched* +binary does **not** reproduce stock; only the separately-built unpatched +`9d5d882d` binary does.) The **vLLM** column is a **different harness** (vLLM +server + client continuous batching) and a **prior-session reference**, so the +cross-engine "% of vLLM" is **indicative, not apples-to-apples**. + +**Dense Qwen3.6-27B-NVFP4** (decode t/s): + +| npl | stock | patched | vLLM (prior) | patched x over stock | +|----:|------:|--------:|-------------:|---------------------:| +| 8 | 68.3 | 85.3 | 70.4 | 1.25x | +| 32 | 119.9 | 211.9 | 211.8 | 1.77x | +| 64 | 142.8 | 305.2 | 309.1 | 2.14x | +| 128 | 155.1 | 382.1 | 418.8 | 2.46x | + +Dense **patched** is parity-to-ahead of vLLM (121 / 100 / 99 / 91% of vLLM across +the widths). + +**MoE Qwen3.6-35B-A3B-NVFP4** (decode t/s): + +| npl | stock | patched | vLLM (prior) | patched x over stock | +|----:|------:|--------:|-------------:|---------------------:| +| 8 | 186.7 | 230.3 | 256.5 | 1.23x | +| 32 | 267.4 | 466.4 | 500.8 | 1.74x | +| 64 | 320.5 | 622.4 | 686.1 | 1.94x | +| 128 | 347.2 | 784.3 | 882.2 | 2.26x | + +MoE **patched** is 90 / 93 / 91 / 89% of vLLM. + +**Caveat on the vLLM column.** It is a **different harness** and a +**prior-session** measurement (not re-run this session), so the cross-engine "% of +vLLM" is **indicative, not apples-to-apples**. Memory (prior session): llama uses +**1.5-3x lower** memory than vLLM. + +**Takeaway.** Re-measured this session, the patch series gives up to **2.46x +(dense) / 2.26x (MoE)** over true-stock `9d5d882d` on the same harness (close to, +slightly below, the prior 2.59x / 2.33x - llama was re-measured, vLLM kept). +Dense is parity-to-ahead of vLLM; MoE **patched** sits at ~89-93% of the +prior-session vLLM. The residual MoE gap is structural (see section 5). + +### (c) Apple Silicon (M4, 16GB Metal) - does the patchset help here? + +Short answer: **no - the wins are CUDA/Blackwell-specific.** Two facts first: the +24GB NVFP4 GGUF doesn't fit a 16GB M4 (SSD paging), and on Metal `supports_op` +**excludes NVFP4** from `MUL_MAT`/`MUL_MAT_ID`/`GET_ROWS` (FP4 matmuls fall back to +CPU - no Apple FP4-MMA). So NVFP4 Qwen3.6 is not a Mac fit; a Metal-native Q4_K is. + +Measured **stock vs patched** (same pin `c299a92c`, both built `-DGGML_METAL=ON`; +the 28-patch series **compiles clean on Metal** - the CUDA code is `#if`-guarded), +on **Qwen3-8B Q4_K_M** (a dense GQA model that fits 16GB and exercises the *live* +Metal features; no Qwen3.6 hybrid GGUF fits 16GB, and the GDN fusions gate off on +Metal anyway), `llama-bench` pp512/tg128 t/s: + +| config | pp512 | tg128 | +|---|---:|---:| +| stock | 226.7 | 20.4 | +| patched, paged **off** | 226.7 | 20.3 (= stock) | +| patched, paged **on** | 222.6 | 19.8 (~0.97x) | + +Concurrency (`batched-bench`) scales identically to stock (S_TG ~20 -> ~137 at +npl32, from llama.cpp's existing batching). **Verdict: neutral-to-slightly-negative +on Metal.** Patched-paged-off equals stock; turning paged on is ~0-3% slower +decode / ~2-8% slower prefill, because the in-kernel block-table flash-attn read +that *recovers* the gather cost is CUDA-only (`fattn-*.cuh`) - on Metal the paged +path falls back to a host-side gather, pure overhead over stock's contiguous read. +Everything Blackwell-specific (NVFP4, GDN fusions via 0030, occupancy) is inert. +So **on Apple Silicon, prefer the stock `llama-cpp` backend.** + +**Vulkan / SYCL** (source analysis): the gated-DeltaNet and SSM_CONV ops DO have +upstream kernels on Vulkan and SYCL (as on Metal), so the Qwen3.6 hybrids RUN on +all three via the non-fused path. The patchset's fusions are gated off there +(0030), so the outcome is the same neutral-to-slightly-negative as Metal - not +"won't run". This backend therefore ships **CUDA-only** (where the fusions are +live + verified); non-CUDA users should use the stock `llama-cpp` backend. See +[`UPSTREAM_LAYER2_SCOPE.md`](docs/UPSTREAM_LAYER2_SCOPE.md) for what native non-CUDA +fused kernels would take. + +--- + +## 5. Dev notes - what we learned + +**Bit-exact methodology.** Every bit-exact patch is gated two ways: (1) a greedy +md5 gate - `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France +is" -n 48 --temp 0 --seed 1 | md5sum`, paged paths prefixed with +`LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged MoE), on the default +chat-template path; and (2) `test-backend-ops` (CUDA0 vs CPU oracle) for every +touched op (`SSM_CONV*`, `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`). +For DGX work, `paged-inference-gates.sh` runs the canonical MoE/dense transcript +md5 checks and selected `test-backend-ops` filters, and refuses to start while +docker, `local-ai-worker`, GPU compute processes, or a non-free GPU lock are +present. + +For direct `llama-server` MTP serving A/B work, use +`paged-mtp-serving-bench.sh`. It runs the same pre/post inference gates, compares +baseline vs `--spec-type draft-mtp`, and captures the h2h client summaries plus +MTP acceptance lines. Phase 15 rejected current MTP serving on GB10 despite +passing safety gates; do not enable it by default. + +**The gate is per-path** (see [`PAGED_BITEXACT_NOTE.md`](docs/PAGED_BITEXACT_NOTE.md)). +Dense is bit-exact across paged/non-paged (`5951a5b4`). The **paged MoE** md5 +(`8cb0ce23`) does **not** byte-match the **non-paged MoE** md5 (`07db32c2`); this +is a benign FP-accumulation-order difference of the paged attention reduction, +**KL-validated** against the f16 reference: KLD(paged||f16) 0.13600 <= +KLD(nonpaged||f16) 0.13660, PPL within +/-0.29, ~zero probability bias - two +equivalent FP-reorderings of the same quantized model, not a regression. Future +paged-MoE regressions therefore compare to `8cb0ce23`, not `07db32c2`. + +**MoE-parity conclusion** (the residual gap is structural). The two heaviest MoE +decode kernels - the GDN-SSM recurrence and the NVFP4-expert GEMM - are llama +**wins** after this series (the recurrence runs at 102.6% of vLLM's bandwidth; +the GEMM ties vLLM at the LPDDR5x BW floor). The residual gap is **bf16-projection +bandwidth + the host scheduling loop**, both at the LPDDR5x floor - not a kernel +llama is losing. The MoE GEMM kernel is *not* where the gap lives. + +**Rejected / flat levers** (recorded so they are not re-tried): + +- **Lever 2 - graph/stream coverage: FLAT.** Bit-exact graph coverage was + exhausted by 0025; more graph/stream overlap is a no-op or small regression on + this model. +- **D1 premise "static decode is host-sync-bound on the MoE-routing readback": + REFUTED.** The hypothesis was that the dominant decode cost is the device->host + readback of MoE routing before launching the per-expert GEMMs (mul_mat_id's + per-expert host-loop fallback). Profiling (GB10, q36-35b-a3b-nvfp4, batched-bench + npl128) shows the opposite: on NVFP4 the grouped stream-k MMQ id-path is what + runs (routing stays device-side), so the host-loop fallback is **never hit** - + `cudaStreamSynchronize` count is *identical* with CUDA graphs on vs off (1457 + either way; only the kernel-launch count changes, ~100k vs ~229k). Steady-decode + GPU-busy is **~99%** (1% idle), i.e. static decode is GPU-bound, not idle waiting + on a sync. The one actionable residual the profile surfaced - per-step host + kernel **re-issue** when the step is not graph-captured - shipped as 0043 + (default-on full-step decode graph), worth +2.6% (npl128) to +5-13% (npl32). The + larger continuous-serving host cost is the graph **rebuild** (0040/0041), and the + irreducible floor is the per-step logits-D2H-before-sampling serial point - none + of which is the MoE-routing readback. +- **Lever 3 - act-quant fusion: FLAT.** The W4A4 act-quant tax is removable only + by W4A16 (a precision change, rejected) or a structural kernel rewrite; no + further bit-exact lever clears it. 0023 already banks the de-dup. +- **Lever 4 - NVFP4 the bf16 GDN/attn projections: REJECTED (KL-gate fail).** + Quantizing the projections to NVFP4 costs ~+6% PPL; vLLM deliberately keeps the + same bf16 projections. No-ship. +- **W4A16-Marlin MoE GEMM: REJECTED.** It would be a precision upgrade nobody + needs bought with a ~5% slower kernel; both kernels are already at the BW floor. + (The "the win was NVFP4-dense-quant, not the Marlin kernel" dense verdict + carries over to MoE.) +- **Chunked parallel-scan GDN prefill (patch 0031): the scalar-serial form was + FLAT-to-SLOWER at C=16 - the tensor-core M5 form (patch 0047) is the win, + now DEFAULT-ON under paged KV.** 0031 implements the upstream "faster pre-fill" + TODO - the FLA-style chunked gated-delta-rule (intra-chunk delta rule solved in + parallel via the UT-transform + forward substitution, inter-chunk recurrence + over n_tokens/C steps), math validated equivalent (numpy f32 NMSE ~1e-13; + `test-backend-ops` within the 1e-7 NMSE gate, a NEW per-path result). **But + GB10's 99KB dynamic-smem opt-in forces C=16** (the 128x128 f32 state alone is + 64KB of the all-shared layout); the scalar-serial scan (`GDN_TC=0`) was then + pinned to 1 block/SM with serial per-thread dk-reductions and measured **~761 + t/s chunked vs ~971 t/s sequential (~22% slower)**, grid-starved at low n_seqs. + The lesson held: **at this head dim the win needs tensor cores, not just + chunking.** Patch 0047 builds those tensor-core forms (KK/QK Gram = M2, KS/QS + state-boundary 3xtf32 = M3, P*U output = M4, full form-T solve + state-update + mma = M5, all `GDN_TC`-selected in one build) and ships **M5** as the default + when `LLAMA_KV_PAGED` is set. It is an f32/tf32-only re-port: the bf16/hybrid + dev-tree machinery (from the dropped 0026 ssm_bf16_tau) and the bf16 CONFIG-C + (M8) plus register-resident M6/M7 variants are NOT part of this series. M5 is the + variant that beats the (already 84.7%-of-peak) sequential scan while staying on + the bit-exact gate: MoE prefill S_PP **+3.5% @npp512 (3x interleaved A/B), +17.7% + @npp2048**; decode S_TG unchanged (the tuned `GDN_CHUNK_MIN=64` engage threshold + is > 1, so the 1-tok decode steps never enter the chunked path - at + `GDN_CHUNK_MIN=1` the chunked path swallows decode and collapses S_TG ~25%, the + reason the threshold is the lever). Bit-exactness is per-path benign: + `test-backend-ops` GATED_DELTA_NET is **94/94** vs CPU with M5 forced (incl. + multi-chunk n_tokens up to 256); the greedy md5 default-on == M5-forced == + canonical on the short gate prompt (paged-MoE `8cb0ce23`, dense `5951a5b4`); on + a long MoE prompt (where the default fires M5 at >=64 tokens) M5 and the + sequential path agree word-for-word until **one** benign greedy token-flip + ("the User:" vs "the User's Request:"), the dense model not flipping at all - + the textbook reduction-order flip greedy amplifies, NMSE-validated. The chunk + geometry stays env-selectable (`GDN_TC`/`GDN_CHUNK_C`/`GDN_DV_TILE`) for further + tuning; M5 is the shipped default because it wins without losing the canonical gate. +- **GDN occupancy retune (patch 0022) was a decode win but an UNCONDITIONAL + dense-prefill regression - now gated by scan length (patch 0046).** Patch + 0022's `(NUM_WARPS=16, COLS_PER_WARP=8)` column-fold of the GDN + sequential-recurrence dispatch (`case 128`) raises per-warp memory-level + parallelism on the short, wide DECODE scans (small `n_tokens`, large + `n_seqs`) - the measured +11.1% dense decode win. Applied unconditionally it + also hit the dense PREFILL path, where the scan is long and narrow: the launch + `grid.z` collapses from `S_v/4 = 32` to `S_v/(16*8) = 1`, the SMs starve, and + profiling attributed the whole ~-6% dense-prefill regression vs stock to + `gated_delta_net` (+54% GPU time at the (16,8) geometry). Patch 0046 gates the + geometry by per-call scan length: long scans (prefill, + `n_tokens >= GDN_PREFILL_NTOK`, default 256) take stock's high-grid.z `(4,1)` + geometry; short scans (decode) keep the `(16,8)` retune. That recovers dense + prefill +7.2% back to stock parity while keeping the decode win, and it is + bit-exact: patch 0022 already proved every selectable `{NUM_WARPS, + COLS_PER_WARP}` variant is byte-identical (the sweep cannot change the md5), so + switching geometry by scan length cannot move the greedy output. The explicit + `GDN_NW`/`GDN_CPW` one-build %peak sweep still overrides (the gate yields when + either is set), so the A/B harness is unchanged. + +**Opt-in bf16-SSM fast mode - DROPPED (was patch 0026, `ssm_bf16_tau`).** The +design premise - that bf16 KL error concentrates in long-memory heads and can be +removed by keeping them f32 - was already shaky: the error scales with the bf16 +head *count* and saturates (~0.06 MeanKLD / ~91% same-top-p) far below any useful +byte saving. The lever was then **removed entirely** once the decode fusions +(0028 recurrent-state gather-fusion + 0029 block-table cache) landed: a clean +re-measurement that forced **all** gated-DeltaNet heads to bf16 (`tau=100000`, +the most aggressive setting) gave **flat** decode throughput - **780.6 vs 780.0 +t/s**. The mode engages but buys **zero** speed; the earlier "+12%" was subsumed +by the fusions. So bf16-tau was a precision trade (not bit-exact) plus extra bug +surface and CUDA template-instantiation compile cost with **no** offsetting +benefit, and patch 0026 was dropped from the series. Lesson recorded so it is not +re-tried: do not reintroduce a per-head SSM-precision lever - the bandwidth it +targeted is already recovered by the gather-fusion + block-table cache. + +--- + +## 6. Architecture and quant generality + +(From the arch-generality and quant-generality audits.) + +- **15 of 16 optimizations are quant-AGNOSTIC.** Only **0023** (NVFP4 + activation-quantize de-dup) is NVFP4-specific. The SSM/paged/MMQ optimizations + help **any quant** of these models (the GDN recurrence, conv, gather and + o_proj-MMQ levers operate on the f32 recurrent state and the routing layout, + not on the weight dtype). +- **Arch-safe to build everywhere.** NVFP4 use is Blackwell-gated and falls back + to dequant on other hardware; the GB10-tuned occupancy params (0022) are + perf-only and env-selectable (`GDN_NW` / `GDN_CPW`), so they never change + correctness on other GPUs. Patch 0030 makes the fused-op emission CUDA-family + + CPU only, so a non-CUDA paged build routes to the safe upstream non-fused path. + +- **What generalizes beyond this backend (upstream candidates).** The *speedups* + are CUDA/Blackwell-specific (which is why Metal/Vulkan don't benefit - section + 4c), but several *findings and ops* are portable and worth upstreaming: + - The headline is hardware-independent: on hybrid gated-DeltaNet models, decode + is bottlenecked by the recurrent-state **plumbing** (memcpy + gathers, ~67% of + the step), not the weight GEMM. The fusions for it (in-place state 0018, gather + 0019/0028, conv 0021) are bit-exact and already have CPU reference kernels, so + they would speed up Qwen3.6 / Qwen3-Next / any hybrid-SSM decode on **every** + backend once the ggml ops gain the respective (Metal/Vulkan) kernels - the + highest-value upstream contribution. + - The o_proj GEMV->MMQ reshape (0020) is a model-graph fix (batch the projection + to hit the GEMM path) - arch-agnostic in principle, trivial to upstream. + - The paged KV + cross-request prefix sharing + decode-first scheduler align with + llama.cpp's own in-progress KV / chunked-prefill work and could inform it. + - The per-path bit-exact md5 gate + the weekly upstream-drift canary is a reusable + maintenance pattern for any vendored-patch backend. + +--- + +## 7. Pin + maintenance policy + +- **Canonical source = the fork branch `mudler/llama.cpp:localai-paged`.** The + vendored `patches/paged/*.patch` files are now generated (one `git format-patch` + per commit) from that branch, which is the pin commit plus the paged patch + commits in order, so there is no more hand-export drift between the dev tree and + the shipped series. +- **Pinned to llama.cpp `0ed235ea2c17a19fc8238668653946721ed136fd`** (kept == the stock `llama-cpp` pin). The pin + is advanced **only** by the manual pin-sync process (this section): + rebase the source-only patch series onto the new tip, rebuild on GPU, pass the + bit-exact gate on every path (dense + MoE, paged + non-paged) plus + `test-backend-ops`, **and confirm the full grpc-server build links on CI**. +- **The pin must track the stock pin.** `grpc-server.cpp` is shared with the stock + backend and tracks the stock pin, so a paged pin that diverges past an upstream + server-API refactor breaks the grpc-server LINK even when the patches are + bit-exact. A bump to `c299a92c` (23 commits ahead of stock) was greedy-md5 + bit-exact but failed to link (undefined `stream_*` server helpers introduced by + the refactor), and was reverted to the then-current stock pin. The bit-exact gate alone does not + catch this; only the full CI grpc-server build does. +- **Decoupled from the nightly auto-bumper.** There is deliberately **no** + `bump_deps.yaml` entry for this backend - a naive `LLAMA_VERSION` bump could + silently shift the tree out from under the patches. +- **Weekly canary.** [`.github/workflows/llama-cpp-paged-canary.yml`](../../../.github/workflows/llama-cpp-paged-canary.yml) + (via [`.github/scripts/paged-canary-apply.sh`](../../../.github/scripts/paged-canary-apply.sh)) + tries the patch series against the latest upstream tip with the build's own + strict `git apply`. **Red = upstream drifted past the series -> run a + PIN_SYNC** (do not bump the pin blindly), following the policy in this section. + +--- + +## 8. Models + +> **Build coverage: CUDA-only.** This backend ships only the CUDA/cublas build +> targets (cuda-12, cuda-13, and the nvidia-l4t arm64 cuda-12/cuda-13 Jetson +> rows). There are no cpu / vulkan / sycl / hipblas / metal-darwin builds: the +> patchset's wins are CUDA/Blackwell-specific (section 4c), so off-CUDA the +> backend is neutral-to-negative and non-CUDA users should run the stock +> `llama-cpp` backend instead. The `backend/index.yaml` meta-backend resolves +> `default`/`nvidia` to a CUDA variant accordingly. + +The benchmarked NVFP4 GGUFs are published and wired into the LocalAI gallery: + +| Gallery entry | Weights (HuggingFace) | Notes | +|---|---|---| +| `qwen3.6-27b-nvfp4-paged` | [`mudler/Qwen3.6-27B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF) | Dense, native Blackwell NVFP4 (FP4-MMA). | +| `qwen3.6-35b-a3b-nvfp4-paged` | [`mudler/Qwen3.6-35B-A3B-NVFP4-GGUF`](https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF) | MoE (256 experts, top-8), `file_type MOSTLY_NVFP4`. | + +Both gallery entries set `backend: llama-cpp-localai-paged` and the paged serving config +(`paged_kv:true`, `max_batch_tokens`, `kv_unified:false`, `parallel`, +`flash_attention:on`, `context_size`). They are bit-exact. The full +backend-split + gallery plan is in +[`LOCALAI_LLAMACPP_BACKEND_PLAN.md`](docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md). + +--- + +## 9. vLLM parity - final state (CLOSED) + +> 2026-07-01 follow-up: the investigation was reopened for MTP safety, +> MTP-serving, graph-shape tracing, and a current-stack serving snapshot. Phases +> 14-20 are recorded in +> [`docs/GB10_PARITY_PHASE0_RESULTS.md`](docs/GB10_PARITY_PHASE0_RESULTS.md) and +> [`docs/PARITY_HANDOFF.md`](docs/PARITY_HANDOFF.md). They did not change the +> GB10 conclusion: MTP/scheduler shortcuts are rejected, and the latest clean +> stack remains below vLLM serving parity. + +The multi-week GB10 (DGX Spark, sm_121) vLLM-parity investigation is **closed**. +The standing, never-re-litigate record - full benchmark, every lever and verdict, +the structural floors, the parity verdict - is +[`docs/VLLM_PARITY_FINAL.md`](docs/VLLM_PARITY_FINAL.md). Summary: + +- **Where we are (GB10, Qwen3.6 NVFP4, vs vLLM 0.23.0).** Decode: dense is + **ahead of vLLM at low concurrency (116.7% at N=8)** and both models are + bandwidth-floored at **~56-68% of vLLM at high concurrency**. Prefill is + **~36% (MoE) / ~43% (dense)** of vLLM. Memory: **1.5-3x lower** than vLLM + (NVFP4-resident; vLLM's peak is a fixed ~109-112 GB 0.85-util reservation, + paged grows with KV from ~50 GB). Output is bit-exact per-path + (`5951a5b4` dense, `8cb0ce23` paged-MoE). +- **Why the residual is a hardware ceiling, not missing work.** Decode kernels + are already **5.4x more GPU-efficient per token** than vLLM's; the gap is the + **LPDDR5x ~273 GB/s** floor. The prefill GEMM is **FP4-MMQ-optimal** (every + alternative - 0033 dequant->cuBLAS, 0034 native FP4-MMA, 0035/Marlin W4A16, + offline-repack and vLLM-verbatim Marlin - was rejected; bf16 TC peak is ~half + FP4 peak, and vLLM itself runs a bf16-Marlin fallback on sm_121). The GDN + chunked scan is at the tractable tensor-core win (**M5 tf32**, patch 0047); + its residual is the **O(C^2) intra-chunk solve + serial recurrence** (occupancy + and dtype proven not the bound: BV -1%, bf16-C64 -18.75%). The serving host + loop is **closed** (~0-1% of the wall; padded-decode built + rejected). +- **Shipped, bit-exact wins.** FP4-MMQ GEMM, M5 tensor-core GDN prefill (0047), + fused residual+RMSNorm (0042), fused GatedRMSNorm+SiLU (0044), GDN-prefill + geometry gate (0046), the SSM decode-fusion stack (0018-0022/0028, up to + 2.46x/2.26x over stock), decode-graph reuse (0040/0043), the memory advantage, + and low-N decode lead. +- **The path to parity is different hardware.** Datacenter Blackwell (HBM, + native tcgen05/CUTLASS FP4) lifts the bandwidth floor and **restores exactly + the vLLM advantages that lose on GB10** (FLA blocked-solve GDN, Marlin/CUTLASS + grouped FP4, HBM-tuned full-cudagraph decode). Re-run the methodology on new + silicon; do not reopen the GB10 levers. + +Latest current-stack MoE serving snapshot (`PTOK=128`, `GEN=64`, current clean +DGX mirror `f2521ab12`, artifact +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`). This run +includes `hardware.txt` and `gate_summary.tsv`; all pre/post gate rows are +`ok`: + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving +snapshots. It targets the clean `~/llama-phase6-source` mirror, checks +docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post +inference gates, writes `hardware.txt`, emits `gate_summary.tsv`, and emits +paged/vLLM ratios. +`hardware.txt` records the GPU identity and hardware class so GB10/workstation +Blackwell evidence is not confused with a future datacenter-Blackwell rerun. +`gate_summary.tsv` records pre/post MoE md5, dense md5, and backend-op checks +so an artifact proves inferencing gates without reading full logs. +Do not use the stale DGX +`~/bench/combined_definitive.sh` without first porting it to the current mirror +and lock discipline. + +Phase 28 challenged the remaining low-conflict NVFP4 grouped-MMQ occupancy +knobs on the same DGX mirror +(`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`). The only buildable +variant, `GGML_CUDA_FP4_MINBLOCKS=2`, was inference-safe before and after +serving (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`) but regressed +n128 decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). The row-tile +knob `GGML_CUDA_FP4_MMQ_Y=64` failed the NVFP4 writeback compile-time +invariant. Do not promote these knobs; grouped-MMQ parity work now requires a +structural kernel change, not launch-bounds or row-tile tweaks. + +Phase 29 added the default-off grouped-MMQ shape trace as patch `0056` +(`/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`). The helper was +added test-first (`test-cuda-mmq-shape-trace`), compiled under CUDA on DGX, and +kept inference stable with the trace disabled and enabled: +MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT_ID 806/806`. Example trace line: +`[LLAMA_MOE_MMQ_SHAPE] type=40 moe=1 ncols_dst=104 nchannels_x=256 ncols_max=13 n_active_est=104 density=1 mmq_x_max=128 mmq_x_lim=64 mmq_x_best=16 mmq_y=128 stream_k=1`. + +Phase 31 extended that trace as patch `0057` +(`/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`) with +`[LLAMA_MOE_MMQ_LAUNCH]` lines from `launch_mul_mat_q`. Default-off, +trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense +`5951a5b4`, `MUL_MAT_ID 806/806`. The n128 serving trace showed decode-like +`4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and +`stream_k_blocks == ntiles_dst`, rejecting a no-fixup/no-stream-k shortcut for +this workload. + +Phase 32 added the small-M classifier trace as patch `0058` +(`/home/mudler/bench/phase32_small_m_classifier/20260701_070127`). Default-off, +trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense +`5951a5b4`, `MUL_MAT_ID 806/806`. The n128 serving trace found 4096 small-M +candidate calls: `mmq_x_best=64` 1800, `48` 1096, `40` 360, `32` 360, `16` +360, `24` 120. This justifies Phase 33 as a default-off tile-policy A/B +(`mmq_x=16`, possibly `8`) rather than a broad kernel rewrite. + +Phase 33 added default-off `LLAMA_MOE_SMALL_M_TILE=` as patch `0059` +(`/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`). The knob is +md5/op safe, but both tested values were slower in same-session n128 serving: +baseline `672.1` decode_agg_tps, tile16 `640.3` (`0.953x`), tile8 `583.2` +(`0.868x`). Do not promote simple smaller `mmq_x` caps for this workload. + +Phase 34 added default-off `LLAMA_MOE_MMID_ROUTE_TRACE=` as patch `0060` +(`/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`). Default-off, +trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense +`5951a5b4`, `MUL_MAT_ID 806/806`. Live n128 serving with trace cap 4096 produced +`mmq=2776`, `mmvq=1320`, and `host_sync=0/4096`; the top shapes were +`mmq ne2=12` (1096), `mmq ne2=18` (480), and `mmvq ne2=8` (360). This refutes +host-sync fallback as the current n128 `MUL_MAT_ID` problem; follow-up work should +target grouped-MMQ small-M kernel partitioning or another measured bucket. + +Phase 35 added default-off `LLAMA_MUL_MAT_ROUTE_TRACE=` as patch `0061` +(`/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`). Default-off, +trace-enabled, and post-serving gates stayed stable: MoE `8cb0ce23`, dense +`5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. Live n128 serving with +trace cap 8192 produced route counts: `mat_f=2888`, `op_cublas=2292`, +`mmq=1328`, `vec_q=1214`, `vec_f=470`. BF16 (`type=30`) dominated the trace +with `mat_f=2485` and `op_cublas=1330`; top BF16 shapes were `mat_f ne1=12` +(775), `op_cublas ne1=18` (760), and `mat_f ne1=8` (570). Next projection work +should trace or optimize the BF16 `op_cublas`/`mat_f` split, not batched cuBLAS. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/ACCELERATOR_PORTING_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/ACCELERATOR_PORTING_SCOPE.md new file mode 100644 index 000000000000..e30a2ed55721 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/ACCELERATOR_PORTING_SCOPE.md @@ -0,0 +1,374 @@ +# Accelerator-porting scope: bringing the paged backend's portable benefits to Metal / SYCL / Vulkan (+ a ROCm note) + +Source-only analysis (no GPU, no build) of which `llama-cpp-localai-paged` benefits +are portable off the CUDA family, and what each port costs per accelerator. This is +the umbrella doc; it BUILDS ON, and does not repeat, +[`UPSTREAM_LAYER2_SCOPE.md`](UPSTREAM_LAYER2_SCOPE.md) (the GDN/SSM fusion kernel +scope) - that doc remains the authoritative reference for benefit #1 below. + +The backend ships **CUDA-only** today (README sections 4c, 8): off-CUDA the fusions +gate off (patch 0030) and NVFP4 falls back to dequant, so it is +neutral-to-slightly-negative there and non-CUDA users run the stock `llama-cpp`. +"Porting the benefits" is the upstream-contribution track that would make these +wins real on the other accelerators. Methodology for the work itself is in +[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md). + +We have **no Metal / SYCL / Vulkan / ROCm hardware here**, so every port is gated +by `test-backend-ops` (backendX-vs-CPU) **on the target hardware** - the same gate +discipline the existing layer-2 doc sets out. + +-------------------------------------------------------------------------------- +## 0. The four benefits and their portability class + +| # | Benefit (patches) | Portable off CUDA? | Where scoped | +|---|---|---|---| +| 1 | **GDN/SSM decode fusions** (0018-0022, 0028) - in-place state write-back, fused recurrent-state gather, conv-state in-place fusion, o_proj MMQ reshape, occupancy retune | YES - per-backend KERNEL work | [`UPSTREAM_LAYER2_SCOPE.md`](UPSTREAM_LAYER2_SCOPE.md) (consolidated in section 1 here) | +| 2 | **Paged KV in-kernel block-table flash-attn read** (0009-0011) | YES - per-backend KERNEL work | **Section 2 here (the new analysis)** | +| 3 | **Decode-first prefill scheduler** (0013/0016) | YES - FREE, host-side, zero kernel work | Section 3 here | +| 4 | **NVFP4 FP4-MMA + its decode levers** (0017/0023/0025) | NO (Blackwell FP4-MMA) - out of scope; two analogues flagged | Section 4 here | + +The two kernel-bearing tracks (#1 and #2) share an identical port SHAPE - they touch +the same decode kernel(s), the same `supports_op`, the same dispatch guard, and +sequence the same way (ops-first PR, then one PR per backend). They should be +**bundled into one per-backend PR**, not pursued as two separate efforts; section 5 +sequences them together. Tracks #3 (free) and #4 (out of scope) are independent. + +-------------------------------------------------------------------------------- +## 1. Benefit #1 - GDN/SSM decode fusions (consolidated; full scope is the layer-2 doc) + +Do not re-derive this here. [`UPSTREAM_LAYER2_SCOPE.md`](UPSTREAM_LAYER2_SCOPE.md) +already establishes, and this doc adopts wholesale: + +- The base `GGML_OP_GATED_DELTA_NET` + `GGML_OP_SSM_CONV` + `GGML_OP_SSM_SCAN` + kernels **already exist on Metal, Vulkan AND SYCL**, so the Qwen3.6 hybrids RUN + on all three today via the upstream non-fused path. Layer-2 is the decode + SPEEDUP, not "make it run." (NB: the README section 4c no longer carries the + stale "no Vulkan kernel" line that the layer-2 doc section 0 was written to + correct - that correction has since been folded into the README, so treat + layer-2 section 0 as historical context, not a live correction.) +- The four fusion ops (A in-place state 0018, B fused state gather 0019, C + conv-update in-place 0021, D conv-tap gather 0028) reuse the existing op enums + with extra `src[]` discriminators; only OP C is a genuinely new kernel, the rest + redirect the read source / write target of the EXISTING kernel. The builders, + CPU reference kernels, model graph and `test-backend-ops` cases are SHARED and + already done. +- Per-backend net-new work, effort and gotchas: **SYCL easiest** (near-verbatim + CUDA mirror, ~250-350 LOC, no shader-gen), **Metal medium** (~350-500 LOC, + fixed 32 simdgroup = simplest bit-exactness), **Vulkan hardest** (~450-650 LOC + + shaders-gen + descriptor growth + per-vendor subgroup validation). +- Bit-exactness is per-backend BY CONSTRUCTION (the fusions redirect addresses, not + the f32 reduce order); gated by `test-backend-ops` (backendX-vs-CPU). +- Upstream path: ops-first PR (incl. the capability-driven replacement for patch + 0030's backend-name allow-list), then one PR per backend. + +The value/effort ranking from that doc (**Metal 1st, SYCL 2nd, Vulkan 3rd**) is +adopted unchanged here and, as section 5 shows, coincides with benefit #2's ranking +- which is why the two bundle cleanly per backend. + +-------------------------------------------------------------------------------- +## 2. Benefit #2 - paged KV in-kernel block-table flash-attn read (NEW scope) + +### 2.0 What it is, and why it is the lever that makes paged KV non-negative off-CUDA + +On CUDA, patches 0009-0011 replaced the per-step host-side K/V gather (patch 0003) +with an **in-kernel paged read**. `ggml_flash_attn_ext` gained an optional +`src[5]` = an I32 block table `[n_view, n_stream]` in token-POSITION order; the +fattn vec/tile kernel maps logical KV index `j` to physical cell +`block_table[seq*ne11 + j]` and reads `K0 + cell*nb11` / `V0 + cell*nb21` in place, +so the `get_rows` of K and V (the bulk of the gather) is gone. A null block table is +the stock contiguous read, byte-identical. Position ordering keeps the online-softmax +reduction order identical to stock, so it is bit-exact (CPU/batch1) by construction. + +The crucial point for portability: **the entire host side is already +backend-agnostic.** The block-table fill (`llama_kv_cache::get_block_table`), the +K/V views, the mask compaction, the `input_block_table` graph input, and the +`ggml.c` / `ggml.h` builder (`ggml_flash_attn_ext_set_block_table`) all live in +`src/` and `ggml/...` shared code. The ONLY per-backend work is, in each backend's +flash-attn kernel: (a) thread one extra source through to the kernel, and (b) do the +indexed read at the K/V load sites. The CPU reference already does it (patch 0009, +`ops.cpp`). + +Off-CUDA today the paged path falls back to the **host-side gather** (patch 0003), +which the README section 4c measured as neutral-to-slightly-negative on the M4 +(~0-3% slower decode / ~2-8% slower prefill vs stock's contiguous read - pure +overhead, because the in-kernel read that *recovers* the gather cost is CUDA-only). +**Porting the block-table read is exactly what flips paged KV from +"neutral-to-negative" to "neutral-to-positive" off CUDA** - it removes the gather +overhead so paged KV's memory-management and prefix-sharing wins come for free +instead of at a decode tax. (The big decode multipliers on the hybrids are still the +benefit-#1 GDN fusions; this benefit is what makes the paged *allocator* pay its own +way off CUDA.) + +### 2.1 The cross-cutting finding (applies to all three backends) + +The indexed per-cell read only fits the **vec / scalar decode kernel**. Every +backend's FAST attention path - CUDA mma, Metal `simdgroup_load` MM, Vulkan +coopmat2, SYCL tile - loads K/V as **contiguous tiles** (8-cell `simdgroup_load`, +`coopMatLoadTensorNV` over a linear stride, shared-memory tile loads) that cannot +express an arbitrary per-cell gather without a staging pre-pass. This is exactly why +the CUDA port (0009-0010) wired ONLY the vec kernel and added a dispatch guard +(`if (dst->src[5]) force vec`). + +So each port mirrors that: **route any FA op carrying a block table onto the vec / +scalar kernel; leave the fast MM path contiguous-only**, and keep the null-table +contiguous read on the fast path untouched. The decode shape (1 query token/stream) +naturally lands on or near the vec/scalar kernel on all three, so this is a small +routing change, not a rewrite of the fast path. + +### 2.2 SYCL - EASIEST (near line-for-line CUDA mirror) + +- **Exists today:** `ggml-sycl/fattn-vec.hpp` is a DPCT-style near-verbatim mirror + of CUDA `fattn-vec.cuh`; the kernel signature ends in the same `nb11..nb33` + cluster the CUDA patch appends `const int* block_table` to (fattn-vec.hpp:65-76). + Args are passed by SYCL lambda value-capture - **no descriptor/binding/push- + constant bookkeeping at all** (strictly easier than CUDA). `supports_op` + (`fattn.cpp` -> `ggml_sycl_get_best_fattn_kernel`) needs no change to ACCEPT + `src[5]`. +- **Port shape (value: medium / effort: LOW):** append `const int* block_table` + to the kernel + `fattn_kernel_t` typedef + `lauch_kernel`/`launch_fattn` + (sourcing `dst->src[5]->data`); 3 read-site substitutions (K at line 318, V at + 389 and 410): `K0 + block_table[seq*ne11 + k_VKQ_0 + i_KQ]*nb11`. +- **Two SYCL-specific gotchas:** + 1. **Pointer pre-advance.** The vec kernel advances `K`/`V` by `k_VKQ_0` OUTSIDE + the inner read (fattn-vec.hpp:293-300), so `i_KQ`/`k` are tile-local. The port + must keep an UN-advanced base `K0`/`V0`, drop the per-iteration `K +=`/`V +=` + on the paged path, and reconstruct the absolute cell. Get this wrong and you + read the wrong cells with NO compile error. + 2. **Dispatch guard is bigger than CUDA's.** f16-GQA decode routes to the TILE + kernel, not vec (`fattn.cpp:198-208` fall-through). Add + `if (dst->src[5]) return BEST_FATTN_KERNEL_VEC;` near the top of + `ggml_sycl_get_best_fattn_kernel`. The shared `fattn_kernel_t` typedef means + the tile kernel must gain a matching ignored `block_table` param (or split the + typedef) - a trivial chore. +- **Bit-exact:** sub-group width (16) is fixed and the indexed read does not touch + lane assignment, loop bounds, or the XOR-reduction stride - reduction order is + invariant, so the paged vec path is byte-identical to SYCL's own contiguous vec + path. Gate: `test-backend-ops` FLASH_ATTN_EXT (with a block-table case) on Intel + GPU. + +### 2.3 Metal - EASY-MEDIUM (decode already routes to the vec kernel) + +- **Exists today:** decode (1 query token/stream, GQA) dispatches to + `kernel_flash_attn_ext_vec` (`ggml-metal-ops.cpp` `..._use_vec`: `ne01 < 20`). + Metal IS a true vec-equivalent (not a single unified FA kernel), and the vec + kernel's quantized K/V branches ALREADY compute a per-cell base address + (`k + ((ic + NE*cc + ty)*nb11)`, ggml-metal.metal:6934 / V at :7045) - so a + per-cell indexed read is unambiguously admissible. `supports_op` + (`ggml-metal-device.m` FLASH_ATTN_EXT) inspects no src count, so `src[5]` is + accepted as-is. +- **Port shape (value: HIGH / effort: EASY-MEDIUM):** append a + `device const char * block_table` param after `dst` (**buffer index 8** for vec) + + a kargs field + a `has_block_table` function-constant; reuse the existing + "bind dummy when null" idiom for a missing table; substitute the cell index with + `block_table[seq*ne11 + cell]` at the K reads (lines 6919/6934) and V reads + (7032/7045) - a localized rewrite of ~2 loops (the fast path must adopt the + per-cell base form the quantized branch already uses). +- **Gotcha:** the **non-vec MM kernel is a HARD blocker** - + `simdgroup_load(..., NS10, ...)` reads 8 physically-CONTIGUOUS KV cells as one + matrix tile (lines 6160 / 6339-6363); an arbitrary gather can't be a single + strided matrix load. Mitigate exactly as CUDA did: force any block-table op onto + the vec kernel in `..._use_vec` (ggml-metal-ops.cpp:2517); leave the MM path + contiguous-only. Also watch a NAME COLLISION: `kernel_flash_attn_ext_blk` is an + existing mask-skip optimization, NOT a paged block table. +- **Bit-exact:** fixed 32-wide simdgroup + address-only redirect = byte-identical to + Metal's own vec contiguous path. Gate: `test-backend-ops` on Apple Silicon. + +### 2.4 Vulkan - MEDIUM (the fast NVIDIA decode path cannot do it) + +- **Exists today:** three FA shaders - `flash_attn.comp` (scalar/vec), + `flash_attn_cm1.comp` (coopmat1, stages K/V through shared memory), + `flash_attn_cm2.comp` (coopmat2, the fast NVIDIA path). FA uses **7 descriptor + bindings (0-6)**; `supports_op` (`ggml-vulkan.cpp` FLASH_ATTN_EXT) checks + specific srcs only, no count check; but `src[5]` is **not even threaded today** - + `ggml_vk_flash_attn` stops at `src[4]` (ggml-vulkan.cpp:14537), so wiring it + through is part of the work. +- **Port shape (value: HIGHEST breadth / effort: MEDIUM):** add binding 7 in the + shader(s), bump `7`->`8` in the three `ggml_vk_create_pipeline` calls (:3997, + :4033, :4070) and the two dispatch subbuffer lists (passing a dummy when null), + and wrap the indexed read in one `phys_kv()` helper applied at the ~4 K + 2 V + load sites (flash_attn.comp; the logical index is the same `(j*Bc + ...)` + expression at every site). +- **Two gotchas, one structural:** + 1. **Push constants are FULL.** `vk_flash_attn_push_constants` is exactly + 128 bytes with a `static_assert(... <= 128)` (the Vulkan guaranteed minimum) - + **no room for a new field.** Signal "block-table enabled" via the existing + `Flags` spec constant (flash_attn_base.glsl, `constant_id=10`, already + bit-packed) - add a `BLOCK_TABLE_ENABLE` bit. The per-seq stride is already + `p.KV`; the seq index is derivable in `init_indices()`. + 2. **coopmat2 (the fast NVIDIA GQA-decode path) is INCOMPATIBLE.** Its K/V load + is a hardware `coopMatLoadTensorNV` over a LINEAR stride + (flash_attn_cm2.comp:307-313/377-383); the decode callback only dequantizes, + it cannot remap the physical address. The indexed read drops cleanly into + **scalar** (which non-GQA decode already uses) and **cm1** (which stages + through shmem - remap the staging loop), but **not cm2**. With a block table + present, NVIDIA GQA decode falls back to scalar/cm1 (slower than cm2, still + correct); the **null-table path keeps using cm2 unchanged**. AMD/Intel (no + cm2) are fully covered by scalar/cm1. +- **Net positive?** Yes. Non-GQA decode already runs scalar (paged read ~free); + AMD/Intel covered; only NVIDIA GQA decode trades cm2 for scalar/cm1 *when a table + is supplied*, and paged KV's payoff is allocator/memory + prefix-sharing, not raw + FA throughput, so the trade is contained and the fast contiguous path is + untouched. +- **Bit-exact:** the read is a per-thread scalar load, subgroup-size agnostic + (already abstracted via the `SubGroupSize` spec constant); position ordering keeps + the reduction order identical, so byte-identical to the backend's own + scalar/cm1 contiguous path. **Build burden is low** - these are EXISTING shader + variants recompiling (no new `string_to_spv` shape), so no shaders-gen matrix + growth. Gate: `test-backend-ops` per vendor (AMD + Intel + NVIDIA). + +### 2.5 Benefit-#2 ranking and the shared dispatch/supports_op pattern + +| backend | value | author effort | structural risk | rank | +|---|---|---|---|---| +| SYCL | medium (Intel GPU) | **LOW** (line-for-line; no bindings) | low (pointer pre-advance; force-vec guard) | easiest | +| Metal | **HIGH** (largest non-CUDA base) | EASY-MEDIUM (decode = vec already) | medium (MM blocker -> force vec) | mid | +| Vulkan| **HIGHEST breadth** (AMD+Intel+NVIDIA) | MEDIUM (7->8 bindings; Flags bit) | medium (cm2 can't; full push-const) | hardest | + +Common to all three (mirrors CUDA 0009-0010): (1) `supports_op` needs no change to +ACCEPT `src[5]`; (2) a **dispatch guard forces any block-table op onto the +vec/scalar kernel**; (3) the fast MM/coopmat2 path stays contiguous-only and the +null-table read on it is byte-identical to stock. + +-------------------------------------------------------------------------------- +## 3. Benefit #3 - decode-first prefill scheduler (FREE portable win, confirmed) + +Patches 0013 (static `LLAMA_PREFILL_BUDGET`) and 0016 (dynamic decode-first +`max(n_ubatch, T-D)`) are **pure host-side scheduler policy inside `update_slots()` +with zero libllama / zero ggml-backend changes** (README sections 2, 3). They change +only the *count* of prefill tokens admitted per step; they touch no kernel, no +`supports_op`, no device code. They are therefore **already backend-portable with no +per-accelerator work** - they run identically on Metal, SYCL, Vulkan, ROCm, CPU. +Byte-identical when off (default-off / short prefill == upstream `-b` chunking). + +This is the cheapest portable benefit: it needs no port at all, only the decision to +leave it enabled in the (currently CUDA-only) build, or to upstream the policy. The +only reason it is not "live everywhere" today is that the backend ships CUDA-only; +the code itself is accelerator-neutral. If the scheduler levers are upstreamed +independently of the kernels, they help any llama.cpp build on any accelerator at +once - the lowest-effort, broadest-reach contribution of the whole series. + +-------------------------------------------------------------------------------- +## 4. Benefit #4 - NVFP4 FP4-MMA (NOT portable) + two backend-agnostic analogues + +The NVFP4 decode track is **Blackwell-specific and out of scope** for accelerator +porting: Metal, SYCL, Vulkan and ROCm/AMD lack native FP4-MMA (Metal `supports_op` +already excludes NVFP4 from `MUL_MAT`/`MUL_MAT_ID`/`GET_ROWS`; on non-Blackwell the +FP4 path dequants). Patch 0017 (dense FP4-GEMM occupancy tune) ships only as the +parity gate + default-off instrumentation even on CUDA, so there is nothing to port. + +Two of the NVFP4 *decode levers*, however, have backend-agnostic analogues worth a +note (do not over-claim - these are observations, not scoped ports): + +- **0023 (NVFP4 activation-quantize de-dup)** - the IDEA generalizes, the patch does + not. The MoE broadcast up/gate projections re-quantize the same token activation + once per expert; 0023 quantizes the unique activations once and byte-copies them + into the expert-gathered layout. Any backend whose MoE path requantizes a shared + activation per-expert (e.g. a Q8 activation-quant before an integer-dot MoE GEMM) + could dedup the same way. It is NOT NVFP4-specific in PRINCIPLE - but it IS the + one quant-specific patch in the series (README section 6), so a port is a + per-backend MoE-quant investigation, not a lift-and-shift. Low priority. +- **0025 (MoE decode re-graph / `LLAMA_MOE_FORCE_GRAPHS`)** - keeping the graph/ + capture path on across the grouped-MMQ MoE decode step is a CUDA-graphs concept. + Metal/Vulkan/SYCL have their own command-buffer/graph reuse machinery; the + generalizable finding is "the grouped MoE decode step has no host sync, so it is + safe to keep in a captured/replayed command buffer." Whether each backend's graph + layer already covers this is a per-backend question. The methodology note (README + dev notes: graph/stream coverage was a FLAT lever beyond 0025 on CUDA) is the + more durable takeaway - do not expect a large graph-coverage win on any backend. + +Neither analogue is on the critical path; both are recorded so the next person does +not mistake them for free ports. + +-------------------------------------------------------------------------------- +## 5. Combined sequencing and top recommendations + +Benefits #1 (GDN fusions) and #2 (block-table FA read) share the port shape +(vec/scalar decode kernel + `supports_op`/dispatch guard + ops-first-then-per-backend +PR) and rank in the SAME order per backend. So sequence them TOGETHER, per backend, +behind one shared ops-first PR: + +1. **PR #1 - OPS (largely done, upstreamable as-is):** the `ggml.h`/`ggml.c` + builders, the CPU reference kernels, the CUDA kernels, the `test-backend-ops` + cases (GDN fusions AND a FLASH_ATTN_EXT block-table case), and the + **capability-driven gate** replacing patch 0030's backend-name allow-list (make + `supports_op` + the dispatch guard authoritative, so routing falls out of the + normal scheduler fallback and no backend name is hard-coded). Independently + mergeable. +2. **PR #2 - Metal:** GDN fusion kernels (layer-2 doc) + block-table read into + `kernel_flash_attn_ext_vec` + the force-vec routing guard. Gate on Apple Silicon. +3. **PR #3 - SYCL:** the near-verbatim CUDA mirror of both tracks + the force-vec + guard. Gate on Intel GPU. +4. **PR #4 - Vulkan:** GDN fusion shaders + the scalar/cm1 block-table read (cm2 + stays contiguous, falls back when a table is present) + the `Flags` spec-constant + bit + the 7->8 binding bump. Gate per vendor. + +Do NOT bundle the backends into one PR (each needs its own hardware for +`test-backend-ops`; reviewers are backend-specialized; a regression in one must not +block the others). + +### Top recommendations + +1. **Metal first, both benefits together.** Largest non-CUDA LocalAI base; the + decode shape already routes to the Metal vec kernel (block-table read is + EASY-MEDIUM there) and the base GDN/conv kernels already exist (fusions are + MEDIUM); fixed 32-wide simdgroup makes bit-exactness the simplest of the three. + Highest value at moderate effort. +2. **SYCL second as the cheap mechanical follow-on.** Both tracks are near + line-for-line CUDA mirrors with no binding/shader-gen bookkeeping, so it is + low-cost insurance even though the Intel-GPU audience is smaller. Budget the + effort on the two SYCL gotchas (pointer pre-advance; the force-vec guard since + f16-GQA decode routes to tile), not on plumbing. +3. **Vulkan last as the high-breadth capstone.** Reaches AMD + Intel + NVIDIA, but + carries the most host glue and the coopmat2 limitation (NVIDIA GQA decode trades + the fast path for scalar/cm1 only when a table is present). Do it once the + pattern is proven on Metal + SYCL. + +A cheaper variant (from the layer-2 doc, reaffirmed): ship **Metal + SYCL together** +right after the ops PR and treat Vulkan as a separate later effort. + +-------------------------------------------------------------------------------- +## 6. ROCm note + +ROCm is in the **CUDA family**, not a separate port: patch 0030's allow-list already +admits `"CUDA"/"ROCm"/"MUSA"`, and the CUDA kernels compile for HIP, so benefits #1 +and #2 are largely already-built or near-free on ROCm rather than a from-scratch +accelerator port. Two caveats: + +- **FP4-MMA (benefit #4) stays NVIDIA-Blackwell-only** - AMD has no native FP4-MMA, + so the NVFP4 path dequants on ROCm exactly as elsewhere. +- **The block-table read's force-vec routing matters on AMD too.** The AMD fast FA + path is the wmma/mma kernel (`fattn-wmma-f16`), which - like CUDA mma, Metal MM + and Vulkan cm2 - ignores the block table; the CUDA dispatch guard already forces a + block-table op onto the vec kernel, so ROCm inherits correct routing, but the + perf trade (vec vs wmma for AMD GQA decode with a table present) should be + measured on AMD hardware before claiming a win. The GDN fusions, being plain + CUDA-C, port to HIP with the rest of the CUDA path. + +Net: ROCm is a "validate, don't re-port" follow-up - confirm the HIP build picks up +the fusions + the force-vec block-table routing and gate it with `test-backend-ops` +on an AMD GPU. It is genuinely separate from, and lighter than, the Metal / SYCL / +Vulkan ports. + +-------------------------------------------------------------------------------- +## 7. Summary + +- **Benefit #3 (decode-first scheduler) is free and already portable** - host-side + policy, zero kernel work; it only needs to be left enabled / upstreamed. +- **Benefits #1 (GDN fusions) and #2 (block-table FA read) are the real ports** - + both are vec/scalar-decode-kernel + `supports_op`/dispatch-guard changes, both + rank Metal-then-SYCL-then-Vulkan, and they bundle into one per-backend PR behind a + shared ops-first PR. +- **Benefit #2 is the lever that makes paged KV non-negative off CUDA** - it removes + the host-gather overhead the README measured as neutral-to-slightly-negative on + the Mac. Feasibility: SYCL EASY, Metal EASY-MEDIUM, Vulkan MEDIUM. The universal + constraint is that only the vec/scalar kernel admits the indexed read; the fast + MM/coopmat2 path is contiguous-only, so route block-table ops onto vec (as CUDA + already does) and leave the fast path's null-table read byte-identical. +- **Benefit #4 (NVFP4 FP4-MMA) is out of scope** (Blackwell only); 0023's de-dup and + 0025's graph-coverage have backend-agnostic *ideas* but no lift-and-shift port. +- **ROCm rides the CUDA path** (validate, don't re-port); FP4-MMA stays Blackwell-only. +- Everything is bit-exact per-backend BY CONSTRUCTION (position-ordered table + + address-only redirect = identical reduction order), gated by `test-backend-ops` + (backendX-vs-CPU) **on the target hardware**, which we do not have here. + + diff --git a/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md new file mode 100644 index 000000000000..2cd5b9125154 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md @@ -0,0 +1,4708 @@ +# llama.cpp vLLM Parity Benchmark Ledger + +This file tracks each parity attempt from Phase70 onward, plus the immediate +context needed to interpret the current record. Append every new attempt here +with artifact path, gates, benchmark rows, and decision. + +## Current Status + +- Goal: reach vLLM speed parity in llama.cpp on GB10. +- Current decision model: MoE `q36-35b-a3b-nvfp4`. +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Current tested source: DGX mirror + `/home/mudler/llama-phase93-qwen3next-gqa-bcast`, local guardrail stack plus + Qwen3Next grouped Q/K broadcast for fused GDN. +- Latest attempt: Phase141 GDN decode-only noise-floor repeat. +- Latest decision: recurrence-level GDN source A/B must normalize by launch + count or control the decode capture window tightly. Phase141 ran five + identical current-binary decode-only captures with pre/post gates green. Raw + `gdn_core_ms` had median `1415.500`, stdev `30.641`, CV `2.146%`, and range + `1410.300..1482.140 ms`, mostly because capture windows recorded `597`, + `598`, `600`, or `630` `gdn_core` launches. Normalized + `gdn_core_ms_per_launch` was much steadier: median `2.359167`, stdev + `0.005399`, CV `0.229%`, range `2.352603..2.366917 ms`. A future + recurrence-level source patch must beat `max(2.0%, 3 * same-binary stdev)` + on repeated A/B medians, using per-launch GDN core when launch counts drift; + for Phase141 that means at least `6.49%` raw `gdn_core` reduction or `2.0%` + launch-normalized reduction. Phase140 still rejects prep-only L2 fusion. The + most defensible small source follow-up is a default-off scalar gate/beta + hoist inside `gated_delta_net_cuda`; the vLLM-style packed decode recurrence + remains a larger redesign, not a shortcut. + Phase137 was rejected with no source changes: `GDN_NW=4 GDN_CPW=1` improved + isolated 1-token GDN rows but regressed real serving versus Phase135 + (`208.0/332.7 -> 206.2/324.9` aggregate/decode t/s, `gdn_core` + `5926.55 -> 6466.27 ms`). Phase135 remains the current best default-off + routed-FFN base without Phase138 finalize, but not parity. Phase135 adds + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it computes `silu(gate) * up` directly into + the NVFP4 MMQ activation layout and launches raw down MMQ, skipping both the + sorted F32 buffer and the separate activation-quant kernel. Focused gates and + canonical opt-in gates passed; trace proved six `mmq_moe_quantized_raw` + launches and zero `mmq_moe_sorted_raw` launches. Focused perf was mixed but + better at the larger sentinel: default `805.92/1031.06 us`, Phase135 + `807.92/1024.97 us` for `n=128/257`. The same opt-in serving profile at the + Phase130 shape passed pre/post gates and improved decode aggregate t/s + `326.9 -> 332.7`, while `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; total + kernel time still rose slightly (`20.1559 -> 20.2498 s`) because GDN and + projection buckets moved up. Next work should either make this path + default-off-clean enough for broader serving comparisons, or attack the + remaining MoE launch/writeback overhead (`mmq_fixup`, route metadata, and + direct weighted combine) rather than another F32 intermediate. Phase134 is + kept as a default-off fused-SWIGLU structural base, + not as a promoted speedup. Phase134 adds + `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it executes `gate_up`, computes + `silu(gate) * up` directly into expert-sorted F32 rows, then calls the raw + MMQ down helper. Selected opt-in gates passed `13/13`; trace proved six raw + sorted launches; canonical opt-in gates passed MoE/dense md5, + `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. + Focused perf was mixed: default `804.92/1026.02 us`, Phase134 + `810.61/1025.68 us` for `n=128/257`. It removes the Phase133 standalone + `glu -> get_rows` boundary and recovers n=257, but the extra fused-SWIGLU + kernel is still slower at n=128. Next work should fuse SWIGLU directly into + the down-MMQ quant buffer, or otherwise remove one more launch/buffer. + Phase133 remains only as a default-off structural base for the + next fused routed-FFN slice, not as a speedup. Phase133 adds + `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`: it keeps baseline `gate_up` and `SWIGLU`, + gathers the computed SWIGLU output into expert-sorted compact F32 rows, and + calls a raw MMQ down helper without constructing fake tensors. Default and + opt-in canonical gates passed with canonical MoE/dense md5s, + `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`; + selected default/Phase132/Phase133 gates passed `13/13`, and trace proved + six `mmq_moe_sorted_raw` launches. Focused perf was not a win: + default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133 + `808.85/1026.87 us` for `n=128/257`. The next phase must fuse + SWIGLU-to-sorted or SWIGLU-to-quant to remove the added gather/quant boundary; + do not promote sorted-down as-is. Phase132 remains the cleaner default-off + scaffold if Phase133 needs to be bypassed. Phase131 challenged the Phase130 fork with two read-only + source explorers. Both rejected another cheap source patch: MoE/FFN-GEMM work + should not continue unless it funds a real fused routed-FFN kernel/executor, + and GDN work should not continue unless it materially changes the f32 + recurrent-state traffic without BF16/quality drift. The next active line is + therefore a default-off fused routed-FFN PoC scoped from vLLM's real fused MoE + design and llama.cpp's current `gate_up -> SWIGLU -> down` executor hook. + Phase131 is a no-source decision/architecture attempt, not a speedup claim. + Keep carrying the Phase93 Qwen3Next GQA-repeat removal + candidate as a decode-profile positive, but it does not close serving parity. + Phase130 refreshed the current-stack graph-node serving profile after the + Phase129 rejection. Pre/post gates stayed green and the profile confirms the + live serving bottleneck remains split between `mmq_nvfp4` (`6009.52 ms`, + `29.82%`) and `gdn_core` (`5891.40 ms`, `29.23%`), with FA only `1.28%` and + get-rows only `1.39%`. This rejects the paged-mask/F16 get-rows idea as the + next source patch and keeps the next credible work on either a larger + MoE/FFN-GEMM executor/kernel or a larger GDN recurrence redesign. Phase129 + tested a default-off Qwen35/Qwen35MoE grouped Q/K broadcast probe for + fused GDN, reusing the existing Qwen3Next op-param path. The default path was + md5/op clean, but the valid opt-in gate changed the MoE greedy md5 to + `b773e2f032aa0e992626d486b321808e`, so the source was rejected and reverted. + Do not port Qwen3Next grouped-broadcast semantics to Qwen35/Qwen35MoE under + the current bit-exact rule. Phase128 scoped the Qwen3Next BF16 GDN S-cache + idea and rejected/reverted the + source probe for the current target: the active `q36-35b-a3b-nvfp4.gguf` + model loads as `qwen35moe`, no true Qwen3Next GGUF was found on DGX, and the + existing Qwen35/Qwen35MoE BF16 S-cache lever was already rejected by the + Phase82 f16-reference KL gate. Phase127 tested the first whole-MoE + expert-major executor using the Phase126 helper; it passed selected + correctness and emitted expert-major markers, but was rejected and reverted + because focused perf regressed `MOE_SWIGLU_DOWN` at both n=128 and n=257. + Phase126 remains the kept scaffold. + Phase104 measured the combined cleanup stack in the normal same-session + serving harness against vLLM at `N=128`. It is md5/op clean and modestly + improves paged serving versus Phase97 (`agg_tps 329.6 -> 338.6`, + `prefill_tps 1734.5 -> 1813.0`, `TTFT 7415.4 -> 7121.6 ms`), but it is not + parity-closing: paged/vLLM is `0.6574` on decode and `0.5122` on aggregate. + Phase105 refreshed the current-stack grouped-MMQ evidence: ragged MoE and + full `MUL_MAT_ID` gates still pass, serving launch traces still have + `fixup=0` and `stream_k_blocks == ntiles_dst`, and the simple live request + landed in density-10 prefill-like shapes (`mmq_x_best=112`) rather than a new + small-M decode opportunity. Phase106 then tested the C1 high-concurrency + operating-point hypothesis at `N=128/192/256`; vLLM completed all legs and + stayed ahead, so C1 is rejected for the current GB10 stack. Do not add another + MMQ micro-policy patch or scheduler shortcut. Phase107 established the + existing fused-MoE correctness guardrails and found that `test-backend-ops + perf` did not emit timing rows for these custom whole-graph cases. Phase108 + added the missing measurement-only harness by exposing the existing MoE + whole-graph cases to perf mode and expanding CSV output to include timing + fields. Use these timings to rank fused routed-MoE work; do not start a fused + kernel without improving one of these rows and preserving md5/op gates. + Phase109 tested the existing default-off W4A16 and FP4 large-M MoE routes, + plus the cheapest grouped-MMQ density/tile-policy knobs, on the Phase108 rows. + All selected op gates passed, but none of the env-only routes is a useful + parity lever: W4A16 and FP4 large-M are much slower at `n_tokens=257`, while + `LLAMA_MOE_DENSITY_MAX=9` / `LLAMA_MOE_MMQ_X=64` are noise-level on + `MUL_MAT_ID_RAGGED_MOE` and do not help `MOE_SWIGLU_DOWN`. The next credible + implementation target is GPU-side routed-MoE metadata construction for the + host-sync fallback/grouped path, taking the vLLM `moe_align_block_size` / + permute-unpermute design as the reference, not importing vLLM wholesale. + Phase110 implemented that first default-off CUDA metadata branch behind + `LLAMA_MOE_GPU_SORT=1`, reusing `mm_ids_helper` and adding a tiny inverse + permutation kernel for the fallback `get_rows` contract. The initial branch + failed `3/13` selected opt-in rows because `mm_ids_helper`'s `ids_dst` is + sorted-to-original while fallback `get_rows` needs original-to-sorted; the + inversion fix made default, W4A16, and W4A16+GPU-sort selected gates `13/13`, + and canonical md5/op gates stayed green. Keep Phase110 as a default-off + structural base only: it improves W4A16 fallback 257-token rows by `7-8%`, + but remains `~1.5x` slower than default grouped-MMQ, so it is not a parity + win by itself. + Phase111 then tried to remove the remaining W4A16 fallback host descriptor + construction by building `w4a16_tile_desc` on GPU from `expert_bounds_dev`. + The first compile needed a pointer mutability fix, then the first runtime + attempt hit a CUDA pool LIFO assertion because the outer expert-bounds + allocation was freed after an inner later allocation. After fixing that, + selected gates passed for the new `LLAMA_W4A16_GPU_TILES=1` path, but clean + perf was flat-to-negative versus Phase110 (`MUL_MAT_ID_RAGGED_MOE n=257` + regressed about `2.0%`). The Phase111 source was reverted; post-revert + W4A16+GPU-sort selected gates passed `13/13`. Do not carry a GPU tile + descriptor path unless it is part of a larger direct-A or graph-safe W4A16 + redesign that removes more than one host-sync/launch bottleneck. + Phase112 implemented the existing default-off `LLAMA_W4A16_DIRECT_A=1` hook + for W4A16 grouped MoE, staging bf16 activations directly from original `src1` + through `ids_to_sorted` instead of materializing a sorted f32 buffer and then + casting it. Selected gates passed for W4A16+GPU-sort, direct-A alone, and + direct-A+GPU-sort (`13/13` each). The useful arm is direct-A+GPU-sort: + `MUL_MAT_ID_RAGGED_MOE n=257` improved `2278.50 -> 2166.22 us` (`+4.93%`) + and `MOE_SWIGLU_DOWN n=257` improved `1551.08 -> 1477.74 us` (`+4.73%`) + versus Phase112's W4A16+GPU-sort control, while the 128-token rows were + neutral/slightly negative. Canonical README md5 gates are green + (`8cb0ce23`, `5951a5b4`) and compact op gates are green on the supported + rows. Keep Phase112 default-off as the next structural base; do not make it + default-on because W4A16 fallback remains slower than the default grouped-MMQ + path. + Phase113 tried the combined follow-up: + `LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`. + It built W4A16 tile descriptors from GPU expert bounds and launched over a + zero-initialized `max_tiles` grid to avoid even the one-int tile-count + readback. Selected correctness stayed green (`13/13`), but perf did not meet + the keep threshold: `MOE_SWIGLU_DOWN n=257` was effectively flat + (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed + (`2148.44 -> 2214.23 us`). The Phase113 source was reverted; post-revert + Phase112 direct-A+GPU-sort selected gates passed `13/13`. + Phase114 then implemented the vLLM-style padded routing contract behind + `LLAMA_W4A16_PADDED_META=1`: separate padded source ids, padded destination + ids, expert ids per M block, a padded W4A16 expert-id consumer mode, and a + direct scatter that skipped the old compact `get_rows_cuda` restore. It was + correctness-clean (`13/13`) but failed the performance gate. Initial artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`; + fix1 artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. + Fix1 added `num_tokens_post_pad` early returns for padded gather/scatter, but + 257-token rows still regressed (`MOE_SWIGLU_DOWN 1477.88 -> 1726.27 us`, + `MUL_MAT_ID_RAGGED_MOE 2163.35 -> 2650.93 us`). The source was reverted and + post-revert Phase112 direct-A+GPU-sort selected gates passed `13/13`. + Phase115 then re-tested the existing default-off MoE small-M MMQ tile knob on + the current Phase108 whole-graph sentinels rather than adding another patch. + Artifact: + `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. + Control and `LLAMA_MOE_SMALL_M_TILE=16/32/64` all passed the selected + `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` correctness gate (`13/13` each), but + none met the promotion rule. The best 128-token rows were tiny/noise-level + wins, while every capped env regressed the 257-token ragged row + (`1452.30 us` control vs `1455.02`, `1458.71`, `1456.88 us`). Reject + small-M row shaping as a parity lever; the next phase should scope a true + fused routed-MoE kernel or a graph-level fusion target that removes materialized + activation/output traffic. + Phase116 implemented that graph-level probe as a default-off CUDA-only + detector for the plain `GLU -> down MUL_MAT_ID` pattern: + `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1`. The candidate computed + `silu(gate) * up` directly into the existing grouped-MMQ NVFP4 activation + buffer, leaving the MMQ kernel and graph API unchanged. Artifact: + `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. + Correctness passed (`13/13`) and the fix1 route emitted the fused trace marker + (`6` hits), but perf failed the promotion gate: `MOE_SWIGLU_DOWN n=257` was + flat (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), + and the non-fused ragged sentinel drifted slower. Source was reverted and the + post-revert selected gate passed `13/13`. Do not retry a standalone fused + SwiGLU-to-MMQ-activation-quant path; the next fused-MoE attempt must remove a + larger boundary than one activation materialization. + Phase117 added default-off boundary tracing/timing around the route-sort, + activation quantization, grouped-MMQ launch, GLU, and whole-graph pattern + detector. Artifact: + `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. + The first timing run proved inline CUDA events are incompatible with CUDA + graph capture (`cudaEventSynchronize` on a capturing stream), so the trace was + guarded to emit `us=-1` during capture and real timings only with + `GGML_CUDA_DISABLE_GRAPHS=1`. Post-guard selected gates passed (`13/13`), + trace mode passed (`7/7`), and canonical gates passed: MoE md5 `8cb0ce23`, + dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + No new runtime optimization is promoted from Phase117. The timing attribution + rejects another small route-sort or standalone GLU/quant shortcut; the next + funded MoE source phase needs a larger pipeline boundary: shared route + metadata across gate_up/down and/or an executor that owns + GEMM1->activation->GEMM2 rather than another local micro-fusion. + Phase118 tested a default-off route metadata cache/reuse prototype. Artifact: + `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. + The first preflight command falsely detected `local-ai-worker` because the + check matched its own shell text; the corrected `pgrep -x local-ai-worker` + preflight was clean. The cache candidate (`LLAMA_MOE_ROUTE_CACHE=1`) was + correctness-clean and did hit (`23` hits, `3` misses on the trace row), but + did not meet the keep rule: `MOE_SWIGLU_DOWN n=257` improved only + `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed + `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; the + post-reject selected gate passed `13/13`. Keep only the local ids metadata + helper refactor if final checks remain clean. This closes route-cache as a + standalone parity lever; next MoE work needs a larger executor boundary than + skipping one metadata build. + Phase119 added a default-off whole-pattern contract trace for + `gate_up MUL_MAT_ID -> views -> SWIGLU -> down MUL_MAT_ID`. Initial artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`; + fix1 artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. + The initial trace proved coverage but exceeded the trace-overhead rule on + `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved + detector work fully off the default path unless a trace env is enabled. It is + correctness-clean (`13/13` selected, `7/7` trace), canonical md5/op clean + (MoE `8cb0ce23`, dense `5951a5b4`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`), and trace overhead is within rule: + `MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us` (`-0.02%`) and `n=257` + `1019.715 -> 1021.836 us` (`-0.21%`). Keep Phase119 as default-off + diagnostic/contract scaffolding only. The next source phase is allowed to + implement a guarded executor, but the executor must match at the earlier + `gate_up MUL_MAT_ID` node so it can own `GEMM1->activation->GEMM2` and skip + the remaining nodes; the current GLU hook is validation-only because GEMM1 + has already executed. + Phase120 added that earlier default-off matcher/trace at the + `gate_up MUL_MAT_ID` node. Initial artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`; + fix2 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. + The initial/fix1 traces proved `skip_ready=4` but emitted noisy unsupported + candidates from unrelated `MUL_MAT_ID` rows; fix2 gates output on the actual + `gate/up` view pair only. Fix2 is correctness-clean (`13/13` selected, + `7/7` early trace), canonical md5/op clean (MoE `8cb0ce23`, dense + `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and early trace + overhead stays within rule: `MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us` + (`-0.62%`) and `n=257` `1020.412 -> 1026.073 us` (`-0.55%`). Keep Phase120 + as the executor entry-point scaffold. The next source phase should add a + default-off executor that starts from this early matcher, first proving safe + ownership/skip accounting, then moving route-plan reuse and fused activation + into that helper. + Phase121 added that default-off executor proof behind + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1`. Initial artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`; + fix1 artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. + The initial run passed gates but emitted zero exec markers because the exec + path was incorrectly nested under the early-trace env. Fix1 made exec + detection depend on either exec or trace env. It is correctness-clean + (`13/13` selected, `7/7` exec), canonical md5/op clean (MoE `8cb0ce23`, + dense `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and emits + `skip=4` markers for the six supported MoE rows. Perf is neutral for the + target sentinel: `MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us` (`+0.21%`) + and `n=257` `1021.115 -> 1020.839 us` (`+0.03%`). Keep Phase121 as the + executor ownership/skip-accounting proof only. The next real optimization + phase should replace one internal boundary inside this helper, starting with + route-plan reuse or activation-in-route-order, while preserving this md5/op + contract. + Phase122 tested route-plan reuse inside the Phase121 executor by exposing + `ggml_cuda_mmq_ids_meta` and passing one built route to both `gate_up` and + `down` MMQ calls behind `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1`. Artifact: + `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. + Correctness was clean (`13/13` selected, `7/7` shared-route), but the target + `MOE_SWIGLU_DOWN n=257` row regressed versus the Phase121 executor + (`1020.850 -> 1051.666 us`, `-3.02%`) and `n=128` also missed the keep + threshold (`808.190 -> 811.836 us`, `-0.45%`). The source was reverted, + including the public MMQ metadata API. Post-reject gates on the reverted tree + passed (`13/13` selected, `7/7` executor) with six retained Phase121 exec + markers. Do not retry route-only metadata reuse; the next MoE executor phase + should attack activation/down data layout, direct activation-to-down input, + or a larger fused GEMM1->activation->GEMM2 boundary. + Phase123 tested that direct activation-to-down input boundary inside the + Phase121 executor. Artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`. + The candidate added an NVFP4-only fused `silu(gate) * up -> down MMQ + activation buffer` path behind + `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1`. Correctness passed (`13/13` + selected, `7/7` fused-down, six fused markers), but perf was flat and missed + the keep rule: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was + `811.153 -> 810.618 us` (`+0.07%`) and `n=257` was + `1023.090 -> 1023.657 us` (`-0.06%`). Source was reverted; post-reject + selected and Phase121 exec gates passed (`13/13`, `7/7`, six exec markers). + Do not retry standalone fused-down quantization. The next MoE source attempt + must either own the full expert-major packed pipeline + `GEMM1->activation->GEMM2` or pivot to another measured bottleneck. + Phase124 refreshed the current-stack graph-node serving profile after the + Phase122/123 rejections. Artifact: + `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. + Pre/post gates were green (MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, + `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`). Serving under graph-node + profiling at `N=128`, prompt `128`, generation `64` was + `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall + `39.738s`. The fine buckets explain the Phase122/123 failures: + `mmq_nvfp4` is now the largest fine bucket (`6074.78 ms`, `30.17%`) and + `gdn_core` remains essentially tied (`5888.31 ms`, `29.25%`), while + `act_quant` is only `674.88 ms` (`3.35%`). Next work should target either a + full expert-major MoE pipeline that materially reduces `mmq_nvfp4` or a GDN + source experiment that materially reduces `gdn_core`; one-boundary + activation/route shortcuts are no longer funded. Phase125 scoping used two + independent code explorers plus a local GDN audit. The challenged conclusion + is that another GDN micro-patch is not funded: prior geometry/store/broadcast + and conv-state attempts already exhausted the small safe space, while a + useful GDN change would be a larger recurrence redesign. The next source + attempt should therefore test the first maintainable slice of a vLLM-style + expert-major MoE pipeline: a default-off MMQ sorted-output primitive that + still uses expert bounds but writes sorted rows, then immediately unsorts as + a proof. Only if that primitive is correctness clean and materially improves + `MOE_SWIGLU_DOWN` should the following phase proceed to a full + `gate_up -> SWIGLU -> down` expert-major executor. + +### Phase141: GDN Decode-Only Noise Floor + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`. +- Summary files: + - `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/summary.tsv` + - `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428/runs.tsv` + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`. +- Harness: + `/home/mudler/bench/phase77_moe_decode_only_profile.sh`. +- Shape: + `N=128 N_PREDICT=2048 DEPTH_TARGET=64 CAPTURE_SECONDS=4 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. + +Gates: + +- All five runs passed pre/post canonical gates: + MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, and + `MUL_MAT_ID 806/806`. + +Run summary: + +| run | total kernel s | GDN ms | GDN launches | `gdn_core` ms | `gdn_core` launches | `gdn_core` ms/launch | `mmq_nvfp4` ms | `mmq_nvfp4` launches | +|-----|---------------:|-------:|-------------:|--------------:|--------------------:|---------------------:|---------------:|---------------------:| +| 1 | `3.553400` | `1500.210000` | `3000` | `1420.150000` | `600` | `2.366917` | `1315.460000` | `4816` | +| 2 | `3.708300` | `1492.230000` | `2994` | `1410.300000` | `598` | `2.358361` | `1470.550000` | `4801` | +| 3 | `3.678100` | `1566.780000` | `3150` | `1482.140000` | `630` | `2.352603` | `1336.250000` | `5061` | +| 4 | `3.698400` | `1495.970000` | `3000` | `1415.500000` | `600` | `2.359167` | `1458.510000` | `4820` | +| 5 | `3.620900` | `1490.630000` | `2985` | `1410.870000` | `597` | `2.363266` | `1389.990000` | `4784` | + +Variance summary: + +| metric | median | mean | stdev | CV | min | max | +|--------|-------:|-----:|------:|---:|----:|----:| +| `total_kernel_s` | `3.678100` | `3.651820` | `0.064600` | `1.769%` | `3.553400` | `3.708300` | +| `gdn_ms` | `1495.970000` | `1509.164000` | `32.419626` | `2.148%` | `1490.630000` | `1566.780000` | +| `gdn_core_ms` | `1415.500000` | `1427.792000` | `30.641160` | `2.146%` | `1410.300000` | `1482.140000` | +| `mmq_nvfp4_ms` | `1389.990000` | `1394.152000` | `69.894566` | `5.013%` | `1315.460000` | `1470.550000` | +| `gdn_core_ms_per_launch` | `2.359167` | `2.360063` | `0.005399` | `0.229%` | `2.352603` | `2.366917` | + +Decision: + +- Raw decode-only `gdn_core` is not a reliable keep/reject metric by itself + unless capture launch counts are fixed; run 3 recorded `630` core launches + while the other runs recorded `597..600`. +- For future GDN source A/B, require repeated medians and either: + - raw `gdn_core` reduction above `max(2.0%, 3 * 30.641160 / 1415.500000) = + 6.49%`, or + - launch-normalized `gdn_core_ms_per_launch` reduction above `2.0%` + (`3 * 0.005399 / 2.359167 = 0.69%`, so the explicit floor dominates). +- This supports a very small default-off scalar gate/beta hoist probe if it can + be kept bit-exact and measured per launch. It does not support large packed + decode recurrence source work yet; that should wait for a broader spec. + +### Phase140: GDN Decode Prep Trace + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`. +- Summary file: + `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348/gdn_prep_kernel_summary.tsv`. + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`, + plus route/layout trace envs. +- Shape: + `N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. + +Gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving/profile result: + +| metric | value | +|--------|------:| +| `agg_tps` | `207.3` | +| `decode_agg_tps` | `328.9` | +| `decode_perseq_tps` | `2.11` | +| `prefill_tps` | `1490.6` | +| `ttft_mean_ms` | `8325.9` | +| `ttft_max_ms` | `14593.3` | +| `wall_s` | `39.501` | +| total kernel time | `20.2002 s` | + +Key buckets: + +| bucket | ms | +|--------|---:| +| `GDN` | `6673.66` | +| `gdn_core` | `5890.44` | +| `MoE/FFN-GEMM` | `6144.19` | +| `mmq_nvfp4` | `5918.31` | +| `gdn_conv` | `454.99` | +| `gdn_gather` | `227.92` | +| `gdn_l2norm` | `100.30` | +| `gdn_sigmoid` | `22.68` | + +Focused kernel summary: + +| kernel | count | ms | avg us | +|--------|------:|---:|-------:| +| `gated_delta_net_cuda` | `4650` | `5804.7074` | `1248.3242` | +| `k_bin_bcast` | `89426` | `1155.3901` | `12.9201` | +| `convert_unary` | `52060` | `659.7529` | `12.6729` | +| `concat_non_cont` | `2130` | `441.9353` | `207.4814` | +| `ssm_conv_update_ids_f32` | `2610` | `227.8964` | `87.3166` | +| `mul_mat_f` | `3670` | `227.7857` | `62.0669` | +| `ssm_conv_long_token_f32` | `1110` | `190.6664` | `171.7715` | +| `unary_gated_op_kernel` | `14340` | `184.3254` | `12.8539` | +| `rms_norm_gate_mul_f32` | `4740` | `170.0508` | `35.8757` | +| `rms_norm_f32` | `9798` | `114.3863` | `11.6745` | +| `rms_norm_pre_add_mul_f32` | `6160` | `108.2927` | `17.5800` | +| `cpy_scalar` | `5130` | `106.8951` | `20.8373` | +| `l2_norm_f32` | `9480` | `100.3024` | `10.5804` | +| `gated_delta_net_chunked_cuda` | `90` | `85.7367` | `952.6300` | + +Decision: + +- Reject an immediate in-GDN Q/K L2-normalization source patch for this shape. +- `l2_norm_f32` is above the absolute Phase139 noise floor + (`3 * 17.8110 ms = 53.433 ms`) but only about `1.7%` of `gdn_core`, below + the phase's `3%` materiality rule. +- Do not spend another phase on prep-only GDN micro-fusion unless a future + profile shows prep kernels above the materiality gate. +- Next GDN work should be recurrence-level, packed-state, or datacenter + Blackwell-specific, and still default-off with md5/op gates. + +### Phase139: Serving Noise-Floor Repeat + +- Date: 2026-07-02. +- Spec: + `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`. +- Plan: + `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`. +- Result type: measurement-only; no llama.cpp source changes. +- Artifact: + `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`. +- Summary files: + - `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901/summary.tsv` + - `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901/runs.tsv` + +Setup: + +- Current patched Phase93 binary: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`. +- Env: + `LLAMA_MOE_ROUTED_FFN_POC=1`, + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`, + `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`. +- Shape: + `N=128 PTOK=128 GEN=64 CTX=131072 PARALLEL=128 BATCH=2048 UBATCH=512`. +- Harness: + `/home/mudler/bench/phase76_current_moe_profile.sh`. + +Gates: + +- All seven runs passed pre/post canonical gates: + MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, and + `MUL_MAT_ID 806/806`. + +Run summary: + +| run | agg t/s | decode agg t/s | wall s | kernel s | MoE ms | mmq_nvfp4 ms | gdn_core ms | mmq_fixup ms | ew_add ms | +|-----|--------:|---------------:|-------:|---------:|-------:|-------------:|------------:|-------------:|----------:| +| 1 | `212.3` | `333.6` | `38.586` | `19.5196` | `5642.07` | `5464.17` | `5877.57` | `104.64` | `371.81` | +| 2 | `208.6` | `330.1` | `39.272` | `19.8779` | `5927.18` | `5719.41` | `5886.67` | `104.49` | `353.07` | +| 3 | `206.8` | `327.2` | `39.606` | `20.0228` | `5983.97` | `5756.85` | `5906.11` | `105.76` | `369.31` | +| 4 | `208.5` | `331.4` | `39.284` | `19.8543` | `5921.30` | `5702.74` | `5911.82` | `104.31` | `371.32` | +| 5 | `208.8` | `335.6` | `39.240` | `20.0571` | `5950.46` | `5720.96` | `5913.65` | `104.53` | `371.59` | +| 6 | `203.4` | `319.7` | `40.277` | `20.3933` | `6285.32` | `6049.05` | `5914.11` | `104.98` | `379.23` | +| 7 | `205.7` | `320.4` | `39.818` | `20.1422` | `6173.88` | `5978.03` | `5929.75` | `106.28` | `355.59` | + +Variance summary: + +| metric | median | mean | stdev | CV | min | max | +|--------|-------:|-----:|------:|---:|----:|----:| +| `agg_tps` | `208.5000` | `207.7286` | `2.8022` | `1.349%` | `203.4000` | `212.3000` | +| `decode_agg_tps` | `330.1000` | `328.2857` | `6.2157` | `1.893%` | `319.7000` | `335.6000` | +| `wall_s` | `39.2840` | `39.4404` | `0.5312` | `1.347%` | `38.5860` | `40.2770` | +| `kernel_s` | `20.0228` | `19.9810` | `0.2717` | `1.360%` | `19.5196` | `20.3933` | +| `moe_ms` | `5950.4600` | `5983.4543` | `204.9581` | `3.425%` | `5642.0700` | `6285.3200` | +| `mmq_nvfp4_ms` | `5720.9600` | `5770.1729` | `193.3642` | `3.351%` | `5464.1700` | `6049.0500` | +| `gdn_ms` | `6695.0800` | `6690.3629` | `17.4585` | `0.261%` | `6656.7100` | `6705.9100` | +| `gdn_core_ms` | `5911.8200` | `5905.6686` | `17.8110` | `0.302%` | `5877.5700` | `5929.7500` | +| `mmq_fixup_ms` | `104.6400` | `104.9986` | `0.7420` | `0.707%` | `104.3100` | `106.2800` | +| `ew_add_ms` | `371.3200` | `367.4171` | `9.4938` | `2.584%` | `353.0700` | `379.2300` | + +Decision: + +- Phase138 remains md5/op clean and focused-positive, but its one-off serving + gain (`+0.63%` aggregate, `+0.24%` decode) is inside same-binary noise. +- Do not use Phase138's single serving run as evidence to stack another + finalize/MMQ micro-patch. +- Future serving claims need repeated A/B medians and must exceed + `max(2.0%, 3 * same-binary stdev)` on aggregate throughput. With this + Phase139 stdev, that is materially higher than the Phase138 one-off delta. +- Bucket attribution also needs repeated evidence: the same binary had + `mmq_nvfp4` CV `3.351%`, so a small MMQ movement is not enough. GDN was much + steadier (`gdn_core` CV `0.302%`), making a measured GDN-side source attempt + the more defensible next phase. + +### Phase138 Attempt 2: Down-MMQ Finalize Writeback + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +- Result type: kept source candidate, default-off; narrow serving-positive + result, not parity and not default-on. +- Focused artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`. +- Canonical gate artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`. +- Serving/profile artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`. +- Source files changed: + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `tests/test-backend-ops.cpp` + +Implementation: + +- Added default-off `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1`, requiring both + `LLAMA_MOE_ROUTED_FFN_POC=1` and + `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1`. +- Added a finalize helper that zeroes the final output, sends router weights + and the final output pointer into the grouped down-MMQ path, and skips the + strict weighted tail only after the helper is selected. +- Added optional finalize metadata to MMQ and stream-k/fixup writeback. The + finalize branch uses the routed destination id to derive `(token, slot)` and + atomically accumulates `sum * weight` into the final token row. +- Left all existing non-finalize MMQ call sites disabled-by-default. + +Focused gates and trace: + +| route | result | +|-------|--------| +| `MOE_SWIGLU_FINALIZE` default | `7/7` | +| `MOE_SWIGLU_FINALIZE` Phase135 opt-in | `7/7` | +| `MOE_SWIGLU_FINALIZE` Phase138 finalize opt-in | `7/7` | +| Phase138 exec trace | `6` records, `FINALIZE_EXEC skip=20 tail_nodes=16` | + +Canonical gates on patched Phase93 binary: + +| route | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| Phase138 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase135 | Phase138 finalize | +|-----|--------:|---------:|------------------:| +| `MOE_SWIGLU_FINALIZE nvfp4 n_tokens=128` | `198.021937 us` | `197.301518 us` | `187.134493 us` | +| `MOE_SWIGLU_FINALIZE nvfp4 n_tokens=257` | `429.235219 us` | `428.697087 us` | `384.673195 us` | + +Serving comparison: + +| metric | Phase135 opt-in | Phase138 finalize opt-in | +|--------|----------------:|--------------------------:| +| aggregate t/s | `208.0` | `209.3` | +| decode aggregate t/s | `332.7` | `333.5` | +| decode per-seq t/s | `2.12` | `2.13` | +| prefill t/s | `1475.1` | `1492.8` | +| TTFT mean | `8468.1 ms` | `8382.5 ms` | +| wall | `39.375 s` | `39.144 s` | +| total kernel time | `20.2498 s` | `20.0489 s` | + +Serving buckets: + +| bucket | Phase135 opt-in | Phase138 finalize opt-in | +|--------|----------------:|--------------------------:| +| `gdn_core` | `5926.55 ms` | `5914.04 ms` | +| `mmq_nvfp4` | `5915.24 ms` | `5802.87 ms` | +| `ew_mul` | `727.04 ms` | `723.65 ms` | +| `act_quant` | `677.59 ms` | `678.17 ms` | +| `get_rows` | `283.62 ms` | `283.80 ms` | +| `mmq_fixup` | `104.81 ms` | `106.06 ms` | +| `ew_add` | not listed in Phase135 top rows | `374.09 ms` | + +Serving pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Decision: + +- Keep Phase138 default-off. It passes md5/op gates and beats Phase135 on the + configured keep thresholds: aggregate/decode throughput, total kernel time, + and `mmq_nvfp4`. +- Do not promote/default-on. The serving delta is small and the weighted + fan-in still appears as `ew_add 374.09 ms`, so this is not a complete tail + removal and not parity. +- Next work should either reduce the remaining fan-in/writeback path more + deeply, or pivot back to the two dominant buckets: `gdn_core` and + `mmq_nvfp4`. + +### Phase138 Attempt 1: MoE Finalize Trace And Full-Tail Sentinel + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +- Result type: kept trace/test scaffold, default-off; no runtime speedup claim. +- Trace-only `MOE_SWIGLU_DOWN` artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943`. +- Traced canonical gate artifact using the old default gate binary, superseded: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093003_gate`. +- Traced canonical gate artifact using patched Phase93 binary: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093141_gate_phase93`. +- Traced early-pattern gate artifact using patched Phase93 binary: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093243_gate_phase93_early`. +- Full-tail sentinel artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail`. +- Canonical gate artifact: + `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical`. +- Source files changed: + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `tests/test-backend-ops.cpp` + +Implementation: + +- Added default-off `LLAMA_MOE_ROUTED_FFN_FINALIZE_TRACE`. +- Added a trace-only strict tail scanner for + `down -> MUL(weights) -> VIEW/ADD rank reduction`. +- Added `MOE_SWIGLU_FINALIZE`, a whole-graph backend-op sentinel that composes + the existing `gate_up -> SWIGLU -> down` graph with the existing + router-weighted rank-add tail. +- No production finalize/writeback kernel was added in this attempt. + +Focused gates: + +| route | result | +|-------|--------| +| `MOE_SWIGLU_DOWN` + Phase135 opt-in + finalize trace | `6` early records, `0` supported tail records | +| `MOE_SWIGLU_FINALIZE` default | `7/7` | +| `MOE_SWIGLU_FINALIZE` + Phase135 opt-in + finalize trace | `7/7`, `6` supported tail records | + +Representative finalize trace row: + +| field | value | +|-------|-------| +| `supported` | `1` | +| `tail_nodes` | `16` | +| `views` | `8` | +| `adds` | `7` | +| `down_ne` | `2048x8x128` on the 128-token row | +| `weights_ne` | `1x8x128` | +| `weights_nb` | `4,4,32` | +| `final_ne` | `2048x128x1` | +| `final_nb` | `4,8192,1048576` | + +Canonical gates on patched Phase93 binary: + +| MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|---------|-----------|-----------|--------------| +| `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Decision: + +- Keep the trace/test scaffold as Phase138 groundwork. +- Proceed next to the default-off down-MMQ finalize/writeback implementation, + but only against `MOE_SWIGLU_FINALIZE` first. +- Do not claim a speedup from this attempt; it only proves graph availability + and preserves md5/op gates. + +### Phase136: Routed-FFN Post-Down Weighted Combine + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`. +- Result type: rejected source probe; source and sentinel test reverted. +- Focused artifact: + `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`. +- Serving/profile artifact: + `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`. +- Source files tested and reverted: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + - `tests/test-backend-ops.cpp` + +Implementation tested: + +- Added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135. +- Extended the early routed-FFN graph hook to skip the post-down + `MUL(weights) -> VIEW* -> ADD*` tail. +- Added a separate F32 weighted-combine kernel that preserved expert-rank + accumulation order. +- Added a temporary full-tail `MOE_SWIGLU_COMBINE` sentinel for focused + correctness/perf. + +Focused gates: + +| route | result | +|-------|--------| +| default selected + full-tail sentinel | `MOE_SWIGLU_DOWN,MOE_SWIGLU_COMBINE,MUL_MAT_ID_RAGGED_MOE 20/20` | +| Phase135 selected + full-tail sentinel | `20/20` | +| Phase136 selected + full-tail sentinel | `20/20` | +| Phase136 trace | `6` combine markers, `6` `mmq_moe_quantized_raw`, `0` `mmq_moe_sorted_raw` | +| post-reject Phase135 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | + +Canonical focused gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase136 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase135 | Phase136 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `803.97 us` | `805.77 us` | `806.75 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.15 us` | `1016.53 us` | `1017.11 us` | +| `MOE_SWIGLU_COMBINE n_tokens=128` | `197.98 us` | `197.74 us` | `191.04 us` | +| `MOE_SWIGLU_COMBINE n_tokens=257` | `429.22 us` | `428.53 us` | `401.81 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase135 opt-in | Phase136 opt-in | +|--------|----------------:|----------------:| +| aggregate t/s | `208.0` | `206.5` | +| decode aggregate t/s | `332.7` | `323.2` | +| decode per-seq t/s | `2.12` | `2.07` | +| prefill t/s | `1475.1` | `1519.5` | +| TTFT mean ms | `8468.1` | `8080.6` | +| wall s | `39.375` | `39.668` | +| total kernel time | `20.2498 s` | `19.9778 s` | + +Serving fine buckets: + +| bucket | Phase135 opt-in | Phase136 opt-in | +|--------|----------------:|----------------:| +| `mmq_nvfp4` | `5915.24 ms` | `5885.05 ms` | +| `gdn_core` | `5926.55 ms` | `5912.65 ms` | +| `cublas_bf16_gemm` | `1782.58 ms` | `1728.15 ms` | +| `cutlass_bf16_gemm` | `756.98 ms` | `767.94 ms` | +| `ew_mul` | `727.04 ms` | `712.97 ms` | +| `ew_add` | not listed in Phase135 top rows | `374.70 ms` | +| `act_quant` | `677.59 ms` | `677.60 ms` | +| `get_rows` | `283.62 ms` | `278.31 ms` | +| `mmq_fixup` | `104.81 ms` | `103.73 ms` | + +Decision: + +- Reject and revert Phase136. The focused synthetic full-tail row improved, but + serving aggregate and decode throughput regressed versus Phase135. +- Keep Phase135 as the current default-off routed-FFN source base. +- Do not retry a separate post-MMQ weighted-combine launch next. A future + combine/finalize attempt needs to remove a larger serving-visible boundary, + likely by integrating finalize/writeback with the down projection or by + changing graph scheduling enough to reduce launches without hurting decode. + +### Phase137: GDN Geometry Sweep + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`. +- Result type: rejected env-only serving probe; no source changes. +- Focused artifact: + `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`. +- Serving/profile artifact: + `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`. + +Implementation tested: + +- No source edits. +- Swept existing `GDN_NW`/`GDN_CPW` runtime knobs: + default `(16,8)`, `(8,8)`, `(16,4)`, `(8,4)`, and `(4,1)`. +- Ran serving only for the best focused candidate: + `LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 + GDN_NW=4 GDN_CPW=1`. + +Focused GDN perf: + +| row | default | `8x8` | `16x4` | `8x4` | `4x1` | +|-----|--------:|------:|-------:|------:|------:| +| `hc=32,hs=128,nt=1,kda=0` | `6.793748 us` | `6.992506 us` | `6.161572 us` | `5.501046 us` | `4.713682 us` | +| `hc=32,hs=128,nt=1,kda=1` | `7.790557 us` | `7.639035 us` | `6.553847 us` | `5.772280 us` | `5.194275 us` | +| `hc=4,hs=128,nt=1,nseq=2,vrep=2,bcast=1` | `5.967364 us` | `4.721621 us` | `3.759859 us` | `3.747508 us` | `3.407998 us` | +| `hc=32,hs=128,nt=64,kda=0` | `153.718880 us` | `152.660797 us` | `119.964294 us` | `94.862477 us` | `125.016141 us` | +| `hc=32,hs=128,nt=256,kda=0` | `491.066095 us` | `678.143207 us` | `495.650551 us` | `454.202876 us` | `489.942166 us` | +| `hc=32,hs=128,nt=512,kda=0` | `1033.510463 us` | `2081.115639 us` | `1197.792952 us` | `1143.683921 us` | `1025.449339 us` | +| `hc=32,hs=128,nt=1024,kda=0` | `2060.529106 us` | `4382.363825 us` | `2403.995842 us` | `2310.580042 us` | `2060.707900 us` | +| `hc=4,hs=128,nt=64,kda=0` | `151.409035 us` | `142.777045 us` | `82.000488 us` | `78.839499 us` | `26.777607 us` | +| `hc=4,hs=128,nt=256,kda=0` | `102.606410 us` | `564.485714 us` | `311.945543 us` | `301.296947 us` | `102.232357 us` | +| `hc=4,hs=128,nt=512,kda=0` | `198.996831 us` | `1127.205870 us` | `620.111479 us` | `600.911809 us` | `198.595701 us` | +| `hc=4,hs=128,nt=1024,kda=0` | `396.210102 us` | `2249.487113 us` | `1240.201770 us` | `1200.476178 us` | `395.850039 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase135 opt-in | Phase137 `GDN_NW=4 GDN_CPW=1` | +|--------|----------------:|-------------------------------:| +| aggregate t/s | `208.0` | `206.2` | +| decode aggregate t/s | `332.7` | `324.9` | +| decode per-seq t/s | `2.12` | `2.08` | +| prefill t/s | `1475.1` | `1499.4` | +| TTFT mean ms | `8468.1` | `8209.4` | +| TTFT max ms | not recorded | `14511.2` | +| wall s | `39.375` | `39.719` | +| total kernel time | `20.2498 s` | `20.7530 s` | + +Serving fine buckets: + +| bucket | Phase135 opt-in | Phase137 `GDN_NW=4 GDN_CPW=1` | +|--------|----------------:|-------------------------------:| +| `gdn_core` | `5926.55 ms` | `6466.27 ms` | +| `mmq_nvfp4` | `5915.24 ms` | `5978.87 ms` | +| `cublas_bf16_gemm` | `1782.58 ms` | `1726.10 ms` | +| `cutlass_bf16_gemm` | `756.98 ms` | `745.00 ms` | +| `ew_mul` | `727.04 ms` | `711.72 ms` | +| `ew_add` | not listed in Phase135 top rows | `367.85 ms` | +| `act_quant` | `677.59 ms` | `681.32 ms` | +| `get_rows` | `283.62 ms` | `284.31 ms` | +| `mmq_fixup` | `104.81 ms` | `103.26 ms` | + +Decision: + +- Reject Phase137. The isolated 1-token GDN rows improved, but real serving + decode, aggregate throughput, total kernel time, `gdn_core`, and `mmq_nvfp4` + all regressed versus Phase135. +- Do not edit source for a GDN launch-geometry retune. +- Next scoped source line: a default-off MoE finalize/writeback integration in + down-MMQ that removes the serving-visible `MUL(weights) -> VIEW* -> ADD*` + tail without adding a standalone combine launch. + +### Phase135: Routed-FFN Fused SWIGLU-to-NVFP4 Quant + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`. +- Result type: source structural base, default-off, serving-profile positive on + decode but not parity-closing. +- Focused artifact: + `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`. +- Serving/profile artifact: + `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`. +- Source files: + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/moe-ffn.cu` + +Implementation: + +- Added `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. +- Added `ggml_cuda_mul_mat_q_moe_quantized(...)`, a raw MMQ launcher that + accepts a caller-owned quantized activation buffer. +- Added a Blackwell/NVFP4-only fused kernel that reads `gate/up` views, uses + the existing ids metadata ordering, computes `silu(gate) * up`, and writes + `block_fp4_mmq` activation layout directly. +- MXFP4 and unsupported shapes fall back to earlier paths. + +Focused gates: + +| route | result | +|-------|--------| +| Phase135 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | +| Phase135 trace | `6` `mmq_moe_quantized_raw` launches, `0` `mmq_moe_sorted_raw` launches | + +Canonical focused gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase135 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf: + +| row | default | Phase134 | Phase135 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `805.920354 us` | `807.650845 us` | `807.921963 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1031.064815 us` | `1027.513292 us` | `1024.971370 us` | + +Serving/profile gate: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics at Phase130 shape: + +| metric | Phase130 default | Phase135 opt-in | +|--------|-----------------:|----------------:| +| aggregate t/s | `208.0` | `208.0` | +| decode aggregate t/s | `326.9` | `332.7` | +| decode per-seq t/s | `2.1` | `2.12` | +| prefill t/s | `1519.6` | `1475.1` | +| TTFT mean ms | `8170.6` | `8468.1` | +| wall s | `39.38` | `39.375` | +| total kernel time | `20.1559 s` | `20.2498 s` | + +Serving fine buckets: + +| bucket | Phase130 default | Phase135 opt-in | +|--------|-----------------:|----------------:| +| `mmq_nvfp4` | `6009.52 ms` | `5915.24 ms` | +| `gdn_core` | `5891.40 ms` | `5926.55 ms` | +| `cublas_bf16_gemm` | `1735.98 ms` | `1782.58 ms` | +| `cutlass_bf16_gemm` | `749.64 ms` | `756.98 ms` | +| `act_quant` | `675.67 ms` | `677.59 ms` | +| `get_rows` | `280.62 ms` | `283.62 ms` | +| `mmq_fixup` | not listed in Phase130 top rows | `104.81 ms` | + +Decision: + +- Keep Phase135 as the best current default-off routed-FFN base. It is + canonical-clean and reduces the dominant `mmq_nvfp4` serving bucket. +- Do not promote it as parity: aggregate serving is unchanged, prefill/TTFT are + worse, and total kernel time is slightly higher due to other buckets. +- Next work should target remaining MoE overhead after fused quant, especially + `mmq_fixup`, route/writeback, and weighted-combine/scatter boundaries, or run + a broader serving comparison to determine whether the decode improvement + persists outside this graph-node profile. + +### Phase134: Routed-FFN Fused SWIGLU-to-Sorted + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`. +- Result type: source structural base, default-off, mixed perf. +- Artifact: + `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`. +- Source files: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + +Implementation: + +- Added `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. +- Passes `gate` and `up` views into the Phase132 routed-FFN helper. +- Executes `gate_up`, builds ids metadata, launches a CUDA kernel to write + `silu(gate) * up` directly into expert-sorted F32 rows, then calls Phase133's + raw sorted-F32 down MMQ helper. +- The fused flag now implies the sorted-down machinery; it does not require + `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1`. + +Selected and trace gates: + +| route | result | +|-------|--------| +| Phase134 selected | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | +| Phase134 trace | `MOE_SWIGLU_DOWN 7/7`, `6` `mmq_moe_sorted_raw` launches | + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| Phase134 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | Phase132 | Phase133 | Phase134 | +|-----|--------:|---------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `804.920354 us` | `807.999195 us` | `808.068383 us` | `810.614642 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1026.024540 us` | `1028.434560 us` | `1029.015432 us` | `1025.682004 us` | + +Decision: + +- Keep Phase134 only as default-off structural plumbing. It removes the + standalone `glu -> get_rows` boundary and recovers the n=257 regression, but + the extra fused-SWIGLU kernel is still slower at n=128. +- Do not promote `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` as a speedup. +- Next work must remove one more boundary, likely by fusing SWIGLU directly + into the down-MMQ quant buffer rather than writing an intermediate sorted F32 + buffer. + +### Phase133: Routed-FFN Sorted-Down Raw MMQ + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`. +- Result type: source structural base, default-off, not a speedup. +- Artifact: + `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`. +- Source files: + - `ggml/src/ggml-cuda/mmq.cuh` + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/moe-ffn.cu` + +Implementation: + +- Exposed `ggml_cuda_mmq_ids_meta` from `mmq.cuh` so the routed-FFN helper can + reuse the existing GPU ids metadata (`ids_src1`, `ids_dst`, `expert_bounds`). +- Added `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, a raw sorted-F32 MMQ entry + that accepts a compact F32 activation pointer plus `ids_dst` and + `expert_bounds` directly. +- Added `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of + `LLAMA_MOE_ROUTED_FFN_POC=1`. The opt-in path executes baseline `gate_up` and + `SWIGLU`, gathers `SWIGLU` output into compact expert-sorted F32 rows, then + runs the raw MMQ down helper. It falls back to Phase132 if strict shape/type + checks fail. + +Selected op gates: + +| route | result | marker | +|-------|--------|--------| +| default | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | none | +| Phase132 `LLAMA_MOE_ROUTED_FFN_POC=1` | `13/13` | `6` whole-pattern exec markers | +| Phase133 `LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` | `13/13` | `6` whole-pattern exec markers | + +Trace proof: + +- `LLAMA_QUANT_TRACE=32` with Phase133 opt-in passed `MOE_SWIGLU_DOWN 7/7`. +- `grep -c mmq_moe_sorted_raw phase133_quant_trace.log` returned `6`, proving + the raw sorted-down helper engaged for the NVFP4 rows. + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| Phase133 via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | Phase132 | Phase133 | +|-----|--------:|---------:|---------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `807.369268 us` | `808.213194 us` | `808.848753 us` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.762195 us` | `1018.870935 us` | `1026.874233 us` | + +Decision: + +- Keep Phase133 only as default-off structural plumbing. It is correctness-clean + and proves the fake-tensor boundary can be replaced with a raw helper, but it + adds a separate gather into sorted F32 rows and is not faster. +- Do not promote `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` as a runtime speedup. +- Next work must remove the new overhead by fusing SWIGLU directly into sorted + rows or directly into the down-MMQ quant buffer. A standalone sorted-down + gather is not a parity lever. + +### Phase132: Default-Off Routed-FFN PoC Scaffold + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. +- Result type: source scaffold, default-off, no math change intended. +- Artifact: + `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. +- Source files: + - `ggml/src/ggml-cuda/moe-ffn.cuh` + - `ggml/src/ggml-cuda/moe-ffn.cu` + - `ggml/src/ggml-cuda/ggml-cuda.cu` + +Build: + +- First incremental build failed at link because the existing CMake build + directory had not reconfigured its globbed CUDA source list, so the new + `moe-ffn.cu` object was not compiled. +- Re-running `cmake -S . -B build` in the DGX mirror picked up `moe-ffn.cu`; + `cmake --build build --target test-backend-ops -j"$(nproc)"` then passed. +- Symbol/string evidence: + `strings build/bin/libggml-cuda.so | grep -c LLAMA_MOE_ROUTED_FFN_POC` + returned `1`. + +Selected op gates: + +| route | result | trace | +|-------|--------|-------| +| default | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | no opt-in markers | +| `LLAMA_MOE_ROUTED_FFN_POC=1` | `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13` | `6` `LLAMA_MOE_WHOLE_PATTERN_EXEC` markers | + +Canonical gates: + +| route | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| `LLAMA_MOE_ROUTED_FFN_POC=1` via `EXTRA_ENV` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Focused perf sanity: + +| row | default | opt-in | delta | +|-----|--------:|-------:|------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `808.318584 us` | `804.868061 us` | `+0.43%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1023.355828 us` | `1022.713701 us` | `+0.06%` | + +Decision: + +- Keep the Phase132 scaffold. It is correctness-clean and neutral, and it gives + the next patch a low-conflict helper boundary for a real fused routed-FFN + slice. +- Do not present Phase132 as a speedup. The helper currently executes the same + baseline `gate_up`, `SWIGLU`, and `down` nodes; it only proves default-off + ownership, capability gating, and reachability. +- Next source phase should replace one internal helper boundary with real work, + preferably a routed-FFN packed workspace or direct sorted activation/down + path that removes more traffic than Phase116/123. + +### Phase131: Fused Routed-FFN Scoping Challenge + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. +- Result type: source-selection and design-gate phase; no source changes and no + DGX benchmark artifact. +- Inputs: + - Phase130 current-stack serving profile: + `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. + - MoE explorer: `019f2140-de84-7eb2-8ab5-0c7d7de336bd`. + - GDN explorer: `019f2141-0af2-7480-bf66-4fd7e67716c5`. + +Decision: + +- Reject another incremental MoE/FFN-GEMM shortcut for Phase131. The current + stack already includes default grouped FP4-MMQ, default-off W4A16 fallback + routes, route metadata scaffolding, and whole-pattern executor ownership + proof. Prior route-only, activation-only, tile-policy, W4A16, sorted-output, + and fake-executor attempts either regressed or were noise-level. +- Reject another incremental GDN shortcut for Phase131. The remaining GDN bucket + is dominated by the f32 recurrent-state scan; the safe space around launch + geometry, gather/identity, producer fusion, store fusion, BF16 S-cache, and + grouped Q/K broadcast has already been tested and rejected under canonical + md5/KL gates. +- Continue only with a larger default-off fused routed-FFN PoC if the vLLM and + llama.cpp audits identify a concrete low-conflict hook. Otherwise, require a + standalone CUDA PoC before touching llama.cpp source. + +Gates: + +- No correctness or performance gates were run for this no-source decision + phase. +- Any follow-up source phase must use the canonical MoE md5 + `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `GATED_DELTA_NET`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`, and selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` + op gates before claiming a speedup. + +### Phase130: Current-Stack Serving Profile Refresh + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-current-stack-serving-profile-phase130.md`. +- Result type: measurement-only profile; no source changes. +- Artifact: + `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072`, graph-node CUDA tracing. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving metrics: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.0` | +| decode aggregate t/s | `326.9` | +| decode per-seq t/s | `2.1` | +| prefill t/s | `1519.6` | +| TTFT mean ms | `8170.6` | +| TTFT max ms | `14315.6` | +| wall s | `39.38` | +| total kernel time | `20.1559 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6646.64 ms` | `32.98%` | +| MoE/FFN-GEMM | `6213.70 ms` | `30.83%` | +| bf16/fp8-proj | `2734.06 ms` | `13.56%` | +| layout-copy | `1260.74 ms` | `6.25%` | +| act-quant | `675.67 ms` | `3.35%` | +| gather | `280.62 ms` | `1.39%` | +| FA | `267.02 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | +|--------|-----:|------:| +| `mmq_nvfp4` | `6009.52 ms` | `29.82%` | +| `gdn_core` | `5891.40 ms` | `29.23%` | +| `cublas_bf16_gemm` | `1735.98 ms` | `8.61%` | +| `cutlass_bf16_gemm` | `749.64 ms` | `3.72%` | +| `act_quant` | `675.67 ms` | `3.35%` | +| `convert_dtype` | `656.25 ms` | `3.26%` | +| `concat_layout` | `443.94 ms` | `2.20%` | +| `gdn_conv` | `443.80 ms` | `2.20%` | +| `get_rows` | `280.62 ms` | `1.39%` | +| `fa` | `257.38 ms` | `1.28%` | + +Decision: + +- The current serving profile remains a tied two-bucket problem: + `mmq_nvfp4` and `gdn_core` are effectively equal and far larger than every + candidate cleanup bucket. +- Do not spend the next source attempt on paged mask/F16 get-rows or FA cleanup: + `get_rows` and FA are below `1.5%` each in this profile, matching the older + Phase63 no-go. +- The next credible source attempt must either reduce the MoE/FFN-GEMM bucket + with a larger executor/kernel than the rejected route/activation shortcuts, or + reduce GDN with a materially different recurrent-state/packed-decode design + rather than the rejected grouped-broadcast/BF16-cache/geometry/store shapes. + +### Phase129: Qwen35 GDN Q/K Grouped Broadcast Probe + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`. +- Result type: source attempted, rejected, and reverted. +- Default gate artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445`. +- Focused GDN perf artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/perf_20260702_065728`. +- Default decode-profile artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_default_20260702_065847`. +- Valid opt-in reject artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre`. +- Post-reject artifact: + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258`. +- Candidate env: + `LLAMA_QWEN35_GDN_QK_BCAST=1`. + +Candidate implementation: + +- Added a default-off `qk_bcast_grouped` branch to `src/models/qwen35.cpp` and + `src/models/qwen35moe.cpp`. +- When enabled, the branch skipped explicit Q/K repeat and called the + state-taking `build_recurrent_attn(..., state, il, true)` overload so the + existing `ggml_gated_delta_net_set_bcast()` op parameter could use grouped + Q/K indexing. +- Default source behavior remained unchanged when the env was unset. + +Evidence: + +- Default canonical gates passed: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; + - `GATED_DELTA_NET 46/46`; + - `MUL_MAT 1146/1146`; + - `MUL_MAT_ID 806/806`. +- The first standalone opt-in gate artifact + `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/optin_20260702_065604` + was not valid evidence because `paged-inference-gates.sh` only injects model + env through `EXTRA_ENV`. +- The valid opt-in gate from the decode harness used + `PROFILE_ENV="LLAMA_QWEN35_GDN_QK_BCAST=1"` and failed before profiling: + MoE md5 became `b773e2f032aa0e992626d486b321808e` instead of the canonical + `8cb0ce23777bf55f92f63d0292c756b0`. +- Focused `test-backend-ops perf -o GATED_DELTA_NET` was effectively neutral + because it exercises op fixtures, not the Qwen35 model-builder branch. The + representative rows were: + +| row | default us/run | opt-in us/run | +|-----|---------------:|--------------:| +| `head_count=32,head_size=128,n_seq_tokens=1024,qk_bcast_grouped=0` | `2064.48` | `2060.23` | +| `head_count=4,head_size=128,n_seq_tokens=256,qk_bcast_grouped=0` | `101.69` | `101.61` | +| `head_count=4,head_size=128,n_seq_tokens=64,v_repeat=2,qk_bcast_grouped=1` | `151.32` | `151.39` | + +- Default decode-profile baseline, before the valid opt-in reject: + +| metric | default | +|--------|--------:| +| total kernel time | `3.6916 s` | +| GDN macro | `1491.99 ms` (`40.42%`) | +| `gdn_core` | `1411.34 ms` (`38.23%`) | +| MoE/FFN-GEMM macro | `1475.96 ms` (`39.98%`) | +| `mmq_nvfp4` | `1458.54 ms` (`39.51%`) | + +- Post-reject rebuild removed the env string from `libllama.so` + (`strings ... | grep -c LLAMA_QWEN35_GDN_QK_BCAST == 0`) and post-reject + gates passed: MoE md5 canonical, dense md5 canonical, `GATED_DELTA_NET 46/46`, + `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Decision: + +- Reject and revert Phase129 source. The candidate is not bit-exact for the + current `qwen35moe` decision model. +- Do not retry the same Qwen3Next grouped Q/K broadcast port for Qwen35 or + Qwen35MoE unless the quality rule is explicitly changed. The current + bit-exact md5 gate rejects it before any perf profile is meaningful. + +### Phase128: Qwen3Next GDN BF16 S-Cache Scope + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-qwen3next-gdn-bf16-s-cache-phase128.md`. +- Result type: source probe rejected and reverted. +- Default gate artifact: + `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939`. +- Verbose smoke artifact: + `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434`. + +Candidate implementation: + +- Temporarily generalized the Qwen35/Qwen35MoE GDN S-cache selector in + `src/llama-model.cpp` to accept + `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16` for `LLM_ARCH_QWEN3NEXT`. +- Preserved the existing `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` behavior. +- Reverted the source probe after validation showed it does not apply to the + current decision model and no true Qwen3Next artifact is available. + +Evidence: + +- Default `GATED_DELTA_NET` op gate passed `48/48`. +- Default canonical gates passed: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; + - `MUL_MAT` passed; + - `MUL_MAT_ID` passed. +- Verbose smoke showed the active model metadata: + `general.architecture = qwen35moe`, `print_info: arch = qwen35moe`. +- With `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`, recurrent cache logs still + showed `S (f32): 60.00 MiB`, as expected for a `qwen35moe` model. +- DGX search found no true Qwen3Next GGUF under `/home/mudler/bench` or + `/home/mudler`. + +Decision: + +- Reject and revert the Qwen3Next selector change for the current parity run. +- Do not retry the existing Qwen35/Qwen35MoE BF16 S-cache lever under the + current rules: Phase81 showed it reduced `gdn_core`, but Phase82 rejected it + because MoE md5 changed and the full f16-reference KL gate missed the hard + acceptance band. +- A future BF16-S-cache attempt needs either a deliberately re-scoped quality + gate or an actual Qwen3Next model artifact to validate. + +### Phase127: Whole-MoE Expert-Major Executor + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-expert-major-phase127.md`. +- Result type: source attempted, rejected, and reverted. Phase126 helper + remains. +- Red artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125`. +- Green artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916`. +- Perf artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`. +- Post-reject artifact: + `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318`. +- Candidate env: + `LLAMA_MOE_WHOLE_EXPERT_MAJOR=1 LLAMA_MOE_WHOLE_EXPERT_MAJOR_TRACE=128`. + +Candidate implementation: + +- Added an opt-in executor at the existing early whole-pattern match. +- Built route metadata once with `ggml_cuda_launch_mm_ids_helper()`. +- Wrote `gate_up` to a sorted F32 temporary using identity `ids_dst`. +- Ran SWIGLU on a fake contiguous split-half `[2*n_ff, ne_get_rows]` tensor. +- Ran down MMQ from sorted activations through the Phase126 + `ggml_cuda_mul_mat_q_moe_with_ids(..., src1_sorted=true)` helper. +- Unpermuted once after down into the real graph destination. + +Attempt notes: + +- The red gate passed by fallback and emitted zero + `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers. +- First green attempt aborted because the executor interpreted `down_w` as + `[n_embd, n_ff, experts]`. Debug trace proved the correct shape is + `[n_ff, n_embd, experts]`; the dimension fix made the selected green gate + pass. + +Gates: + +| gate | result | +|------|--------| +| red `MOE_SWIGLU_DOWN` | `7/7`, zero expert-major markers | +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| opt-in `MOE_SWIGLU_DOWN` | `7/7`, six expert-major markers | +| candidate canonical md5/op | skipped because perf rejected source | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| post-reject MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-reject dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-reject `MUL_MAT` | `1146/1146` | +| post-reject `MUL_MAT_ID` | `806/806` | + +Focused perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| default | `802.57 us` | `1236.67 us` | `1023.25 us` | `1455.65 us` | +| expert-major opt-in | `812.14 us` | `1238.50 us` | `1039.36 us` | `1455.06 us` | + +Decision: + +- Reject and revert Phase127 source. The path passed correctness but missed the + keep rule: `MOE_SWIGLU_DOWN n=128` regressed about `1.2%` and `n=257` + regressed about `1.6%`; no row reached the required `>=3%` improvement. +- Do not retry the same fake-tensor whole-executor shape. It removes the early + unsort boundary but adds enough temporary traffic and quant/layout work to + lose on the focused rows. The next MoE attempt must reduce temporary traffic + or move closer to a real fused grouped MMQ/SWIGLU/down path; otherwise pivot + to the scoped GDN BF16 S-cache experiment with non-md5 numerical gates. + +### Phase126: MMQ Presorted Helper Scaffold + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-mmq-presorted-helper-phase126.md`. +- Result type: source scaffold kept; no default behavior change intended. +- Artifact: + `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`. +- Source scope: + - `ggml/src/ggml-cuda/mmq.cu` + - `ggml/src/ggml-cuda/mmq.cuh` +- Candidate implementation: + - refactored the current MoE `ggml_cuda_mul_mat_q()` id path into an + internal helper that accepts prebuilt `ids_src1`, `ids_dst`, and + `expert_bounds`; + - added the public CUDA-internal wrapper + `ggml_cuda_mul_mat_q_moe_with_ids(..., bool src1_sorted)`; + - preserved current behavior by having the existing path build metadata and + call the helper with `src1_sorted=false`; + - added `src1_sorted=true` support for the future whole-MoE executor without + wiring that executor in this phase. + +Attempt notes: + +- Initial Phase126 build/gate attempt compiled and selected gates passed, but + local review found the helper had widened the default MMQ q-buffer stride from + `n_expert_used` to `ne_get_rows`. The fix1 attempt restored the old stride + for `src1_sorted=false`; that is the accepted artifact below. +- One canonical gate invocation failed because it was nested under an outer + DGX lock while `paged-inference-gates.sh` owns the lock itself. The gate was + rerun cleanly outside the outer lock. + +Gates: + +| gate | result | +|------|--------| +| build `test-backend-ops llama-completion` | passed | +| selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Focused perf: + +| row | runs | us/run | TFLOPS | +|-----|-----:|-------:|-------:| +| `MOE_SWIGLU_DOWN n=128` | `1243` | `805.99` | `11.99` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `832` | `1243.85` | `2.59` | +| `MOE_SWIGLU_DOWN n=257` | `984` | `1018.74` | `19.05` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `704` | `1452.84` | `4.45` | + +Decision: + +- Keep the scaffold as Phase127 dependency. This phase is perf-neutral versus + the Phase125 baseline/control band and preserves canonical md5/op gates. +- Do not claim parity progress from Phase126 alone. The useful next step is to + use this helper inside the whole-pattern executor so `gate_up` output, + SWIGLU, and `down` input stay in expert-major order, with one unpermute after + the full FFN. + +### Phase125: Expert-Major Sorted Output Scope + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`. +- Result type: source implementation spec and scoped next attempt; no source + change yet. +- Subagent findings: + - llama.cpp audit: the full expert-major executor is credible but too large + for a first patch. The first slice should add a sorted-output grouped MMQ + mode so `expert_bounds` can be used without scattering through `ids_dst`. + - vLLM audit: portable ideas are expert-major layout across both GEMMs, + one permute/unpermute boundary, expert offsets for activation quant/scales, + and whole-layer measurement. CUTLASS/FlashInfer pointer-array, TMA, and + FP4 scale-swizzle contracts should not be copied into GGML/MMQ. + - local GDN challenge: Phase124's `gdn_core` bucket is material, but prior + small GDN attempts already rejected the obvious decode/core knobs. A new + GDN win would need a larger recurrence redesign, not a Phase125 shortcut. +- Decision: + - Phase125 source was tested and rejected. Do not carry + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT`, the `mmq_args` identity-destination + flag, the MMQ sorted-output temporary, or the immediate unsort proof path. + - The full expert-major `gate_up -> SWIGLU -> down` executor remains the + right conceptual MoE target, but the first slice proved that sorted-output + plus immediate unsort is too expensive to be a stepping stone by itself. + Any follow-up must avoid adding an extra unsort boundary and must consume + sorted activations directly in the down GEMM. +- Red/baseline attempt: + - Red artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/red_valid_20260702_032918`. + - Baseline artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/baseline_valid_20260702_032923`. + - Red env: + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32`. + - Red result: `test-backend-ops perf -o MOE_SWIGLU_DOWN` exited `0` and + emitted `0` `LLAMA_MOE_EXPERT_MAJOR_SORTED` markers, as expected before + implementation. + - Baseline selected gate: + `test-backend-ops test -o MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` passed + `13/13`. + +Baseline perf rows: + +| row | runs | us/run | GFLOP/run | TFLOPS | +|-----|-----:|-------:|----------:|-------:| +| `MOE_SWIGLU_DOWN n=128` | `1243` | `809.70` | `9.66` | `11.93` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `832` | `1244.18` | `3.22` | `2.59` | +| `MOE_SWIGLU_DOWN n=257` | `984` | `1016.44` | `19.40` | `19.09` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `688` | `1453.65` | `6.47` | `4.45` | + +Source attempt: + +- Artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`. +- Candidate env: + `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1 LLAMA_MOE_EXPERT_MAJOR_SORTED_TRACE=32`. +- Candidate implementation: + - added an internal `mmq_args` identity-destination flag; + - wrote NVFP4 grouped MMQ output to a sorted temporary when the env was set; + - inverted `ids_dst` on GPU and immediately used `get_rows_cuda` to restore + the normal destination layout; + - emitted bounded `LLAMA_MOE_EXPERT_MAJOR_SORTED` trace markers. +- Correctness: + - default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`: `13/13`; + - opt-in sorted `MOE_SWIGLU_DOWN`: `7/7`; + - opt-in correctness markers: `12` (`gate_up` and `down` for six NVFP4 + rows). + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `806.13 us` | `1250.99 us` | `1027.15 us` | `1457.69 us` | +| Phase121 exec | `805.16 us` | `1247.92 us` | `1023.83 us` | `1457.67 us` | +| sorted-output proof | `888.76 us` | `1283.17 us` | `1192.05 us` | `1528.27 us` | + +Rejection: + +- Reject and revert. The proof passed correctness, but it badly missed the keep + rule: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed by about + `10.4%` and `n=257` regressed by about `16.4%`. The ragged standalone row + also regressed. +- Post-reject artifact: + `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`. +- Post-reject gates: + - build: `0`; + - selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE`: `13/13`; + - retained Phase121 exec `MOE_SWIGLU_DOWN`: `7/7`, six exec markers; + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`; + - dense md5: `5951a5b4d624ce891e22ab5fca9bc439`; + - `MUL_MAT`: `1146/1146`; + - `MUL_MAT_ID`: `806/806`. + +### Phase124: Current MoE Serving Graph-Node Refresh + +- Date: 2026-07-02. +- Artifact: + `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. +- Result type: current-stack llama.cpp graph-node serving profile; no source + change. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, `PTOK=128`, `GEN=64`, + `PARALLEL=128`, `CTX=131072`, `BATCH=2048`, `UBATCH=512`. +- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with + `/home/mudler/bench/bucket2.py`. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving result under graph-node profiling: + +| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| `128` | `206.2` | `320.3` | `2.11` | `1536.4` | `8826.7` | `39.738` | + +Macro buckets: + +| bucket | time ms | share | instances | +|--------|--------:|------:|----------:| +| GDN | `6665.04` | `33.10%` | `20790` | +| MoE/FFN-GEMM | `6246.97` | `31.03%` | `52484` | +| bf16/fp8-proj | `2687.28` | `13.35%` | `51960` | +| layout-copy | `1259.59` | `6.26%` | `79100` | +| ew-mul(weight/norm/GDN) | `728.03` | `3.62%` | `50422` | +| act-quant | `674.88` | `3.35%` | `36084` | +| FA | `264.14` | `1.31%` | `3530` | + +Fine buckets: + +| bucket | macro | time ms | share | instances | +|--------|-------|--------:|------:|----------:| +| `mmq_nvfp4` | MoE/FFN-GEMM | `6074.78` | `30.17%` | `33204` | +| `gdn_core` | GDN | `5888.31` | `29.25%` | `4500` | +| `cublas_bf16_gemm` | bf16/fp8-proj | `1722.37` | `8.55%` | `21970` | +| `cutlass_bf16_gemm` | bf16/fp8-proj | `766.57` | `3.81%` | `26380` | +| `ew_mul` | ew-mul(weight/norm/GDN) | `723.07` | `3.59%` | `46494` | +| `act_quant` | act-quant | `674.88` | `3.35%` | `36084` | +| `convert_dtype` | layout-copy | `660.48` | `3.28%` | `51300` | +| `gdn_conv` | GDN | `457.10` | `2.27%` | `6960` | +| `concat_layout` | layout-copy | `440.02` | `2.19%` | `2040` | + +Decision: + +- Phase124 confirms the current serving gap is still a two-bucket problem: + `mmq_nvfp4` and `gdn_core` together account for about `59.4%` of kernel + time. +- The `act_quant` bucket is only `3.35%`, explaining why Phase116/123 + fused-activation shortcuts did not move end-to-end rows. +- Do not fund more route-only, activation-only, or tile-policy MoE shortcuts. + Next source work must either own the full expert-major MoE pipeline to reduce + `mmq_nvfp4`, or attack `gdn_core` with a default-off GDN decode experiment + measured against this Phase124/Phase77 bucket. + +### Phase123: MoE Executor Fused Down Input + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-executor-fused-down-input-phase123.md`. +- Artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`. +- Red check artifact: + `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`. +- Candidate env: + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1`. +- Source decision: reject and revert. Do not carry the + `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN` env, NVFP4 fused SwiGLU quant kernel, + or `ggml_cuda_mul_mat_q_moe_swiglu_down()` helper. + +Gates: + +| gate | result | trace markers | +|------|--------|---------------| +| red check fused-down trace before implementation | `7/7` test rows | `0` fused-down markers | +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| fused-down `MOE_SWIGLU_DOWN` | `7/7` | `6` fused-down markers | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| post-reject Phase121 exec `MOE_SWIGLU_DOWN` | `7/7` | `6` exec markers | + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `812.340097 us` | `1242.909856 us` | `1021.592480 us` | `1461.043605 us` | +| Phase121 exec | `811.152856 us` | `1248.876202 us` | `1023.089980 us` | `1455.405523 us` | +| fused-down | `810.617860 us` | `1250.528750 us` | `1023.657464 us` | `1459.239826 us` | + +Decision: + +- Reject the standalone fused-down activation quantization path. It passed + correctness, but the target row was flat-to-negative and far below the `2%` + keep rule. +- Keep Phase121 executor proof only. The next MoE attempt should not be another + one-boundary activation materialization shortcut; it needs a full + expert-major packed pipeline or a different measured bottleneck. + +### Phase122: MoE Shared Route Metadata + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-shared-route-meta-phase122.md`. +- Artifact: + `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. +- Candidate env: + `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1`. +- Source decision: reject and revert. Do not carry the public + `ggml_cuda_mmq_ids_meta` API, shared-route executor helper, or + `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE` env. + +Gates: + +| gate | result | trace markers | +|------|--------|---------------| +| default selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| shared-route `MOE_SWIGLU_DOWN` | `7/7` | `6` shared-route markers | +| post-reject selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | n/a | +| post-reject Phase121 exec `MOE_SWIGLU_DOWN` | `7/7` | `6` exec markers | + +Perf: + +| arm | `MOE_SWIGLU_DOWN n=128` | `MUL_MAT_ID_RAGGED_MOE n=128` | `MOE_SWIGLU_DOWN n=257` | `MUL_MAT_ID_RAGGED_MOE n=257` | +|-----|-------------------------:|--------------------------------:|-------------------------:|--------------------------------:| +| control | `808.519710 us` | `1245.913462 us` | `1022.664622 us` | `1457.690407 us` | +| Phase121 exec | `808.189863 us` | `1250.302500 us` | `1020.849593 us` | `1461.318314 us` | +| shared-route | `811.836039 us` | `1246.143029 us` | `1051.665618 us` | `1449.548295 us` | + +Decision: + +- Reject the shared-route metadata API/path: it did not meet the keep rule and + regressed the target `MOE_SWIGLU_DOWN n=257` row by about `3%` versus the + Phase121 executor. +- Keep Phase121 executor proof only. Route-only reuse is closed as a parity + lever; the next executor scope must remove a larger activation/down boundary. + +### Phase121: MoE Whole-Pattern Exec Proof + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-pattern-exec-proof-phase121.md`. +- Initial artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`. +- Fix1 artifact: + `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. +- Source decision: keep fix1 default-off executor proof; it proves ownership + and skip accounting but does not yet fuse work. + +Gates: + +| run | result | +|-----|--------| +| fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix1 exec proof, `LLAMA_MOE_WHOLE_PATTERN_EXEC=1 MOE_SWIGLU_DOWN` | `7/7` | +| fix1 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix1 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix1 `MUL_MAT` gate | `1146/1146` | +| fix1 `MUL_MAT_ID` gate | `806/806` | + +Perf: + +| row | control us | exec us | change | +|-----|-----------:|--------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `807.772325` | `806.051488` | `+0.21%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1021.114837` | `1020.839431` | `+0.03%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1243.250000` | `1243.313702` | `-0.01%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1450.889205` | `1456.279070` | `-0.37%` | + +Trace: + +- Initial run passed correctness but emitted `0` exec markers because the exec + branch was accidentally nested under the early trace env condition. +- Fix1 exec gate emitted `6` `skip=4` markers for the supported correctness + rows. +- Fix1 exec perf emitted `6` `skip=4` markers covering `n_tokens=128` and + `n_tokens=257`. + +Decision: + +- Keep the default-off executor proof. +- It changes no default behavior and proves that the early matcher can own + `gate_up`, skip both views, execute `GLU` and `down`, and return `4`. +- Next phase should turn the proof helper into a useful executor by replacing + one internal boundary at a time. The most defensible next slice is route-plan + reuse inside the helper or activation in route-slot order, not another graph + detector. + +### Phase120: MoE Early Whole-Pattern Matcher + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-early-whole-pattern-phase120.md`. +- Initial artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`. +- Fix1 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040515_fix1`. +- Fix2 artifact: + `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. +- Source decision: keep fix2 default-off early matcher/trace; no execution is + skipped yet. + +Gates: + +| run | result | +|-----|--------| +| fix2 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix2 early trace, `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE=16 MOE_SWIGLU_DOWN` | `7/7` | +| fix2 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix2 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix2 `MUL_MAT` gate | `1146/1146` | +| fix2 `MUL_MAT_ID` gate | `806/806` | + +Perf: + +| row | control us | early trace us | change | +|-----|-----------:|---------------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `803.937002` | `808.978278` | `-0.62%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1020.411585` | `1026.072597` | `-0.55%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1246.259615` | `1243.800481` | `+0.20%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1456.428779` | `1456.109012` | `+0.02%` | + +Trace: + +- Initial artifact emitted `96` early markers with only `6` supported rows; + fix1 emitted `104` markers with only `6` supported rows. +- Fix2 emits exactly `6` early markers, all supported, covering + `n_tokens=128` and `n_tokens=257`. +- The fix2 marker proves the executor entry contract before GEMM1 dispatch: + `skip_ready=4`, `ids_match=1`, `swiglu=1`, `n_used=8`, `experts=128`, + `n_embd=2048`, `n_ff=768`. + +Decision: + +- Keep the default-off early matcher/trace. +- This does not improve runtime by itself; it establishes the correct hook for + the next executor attempt. +- Next phase should add a guarded executor at this matcher. First prove that it + can own the five-node sequence and return `4` only after reproducing the + existing outputs, then move useful work into the helper: route-plan reuse + across both expert GEMMs, activation in route-slot order, and later direct + weighted combine. + +### Phase119: MoE Whole-Pattern Contract + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-whole-pattern-contract-phase119.md`. +- Initial artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`. +- Fix1 artifact: + `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. +- Source decision: keep default-off contract trace after fix1; no runtime + executor yet. + +Gates: + +| run | result | +|-----|--------| +| fix1 selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| fix1 trace gate, `LLAMA_MOE_WHOLE_PATTERN_TRACE=16 MOE_SWIGLU_DOWN` | `7/7` | +| fix1 MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| fix1 dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| fix1 `MUL_MAT` gate | `1146/1146` | +| fix1 `MUL_MAT_ID` gate | `806/806` | + +Initial perf: + +| row | control us | trace us | change | +|-----|-----------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `809.251810` | `811.777597` | `-0.31%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1015.069697` | `1028.937243` | `-1.35%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1247.114183` | `1247.876202` | `-0.06%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1450.355114` | `1456.109012` | `-0.40%` | + +Fix1 perf: + +| row | control us | trace us | change | +|-----|-----------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `805.399839` | `805.584071` | `-0.02%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1019.715447` | `1021.836382` | `-0.21%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1247.504808` | `1247.542067` | `-0.00%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1458.351744` | `1454.090116` | `+0.29%` | + +Trace: + +- Initial and fix1 trace perf emitted `6` whole-pattern markers. +- Fix1 covered supported NVFP4 contract rows at `n_tokens=128` and + `n_tokens=257`: `view_pair=1`, `ids_match=1`, `swiglu=1`, + `n_used=8`, `experts=128`, `n_embd=2048`, `n_ff=768`. +- The trace gate also covered smaller correctness shapes; the F32 row reports + `supported=0` by design because the executor target is native FP4. + +Decision: + +- Keep the default-off trace/contract scaffold. +- This phase does not promote a runtime optimization. +- The next executor attempt should be matched from the earlier + `gate_up MUL_MAT_ID` node, not from the current `GLU -> down` validation + hook, so it can own route-plan reuse, GEMM1, activation, GEMM2, and later + weighted combine. + +### Phase118: MoE Route Cache + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-route-cache-phase118.md`. +- Artifact: + `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. +- Source decision: reject and revert runtime cache; keep helper refactor only. + +Preflight note: + +- The initial `pgrep -af "[l]ocal-ai-worker"` preflight was a false positive + because the remote shell contained the literal text `local-ai-worker busy`. + Corrected follow-up used `pgrep -x local-ai-worker`; Docker, worker, and GPU + compute-app checks were clean. + +Gates: + +| run | result | +|-----|--------| +| helper refactor selected gate | `13/13` | +| cache default selected gate | `13/13` | +| cache opt-in selected gate, `LLAMA_MOE_ROUTE_CACHE=1` | `13/13` | +| post-reject selected gate | `13/13` | + +Perf: + +| row | baseline us | cache us | change | +|-----|------------:|---------:|-------:| +| `MOE_SWIGLU_DOWN n_tokens=128` | `799.360447` | `803.738437` | `-0.55%` | +| `MOE_SWIGLU_DOWN n_tokens=257` | `1017.711382` | `1011.915152` | `+0.57%` | +| `MUL_MAT_ID_RAGGED_MOE n=128` | `1239.332933` | `1239.560096` | `-0.02%` | +| `MUL_MAT_ID_RAGGED_MOE n=257` | `1447.588068` | `1441.795455` | `+0.40%` | + +Trace: + +- `LLAMA_MOE_ROUTE_CACHE=1 LLAMA_MOE_ROUTE_CACHE_TRACE=128` on + `MOE_SWIGLU_DOWN n_tokens=128`: `23` hits, `3` misses. + +Decision: + +- Reject and revert the runtime route cache. It proves reuse is possible, but + the win is too small for the additional context-owned state and graph-capture + lifetime surface. +- Keep only the local `ggml_cuda_mmq_ids_meta` helper refactor as low-conflict + groundwork for a future whole-pattern executor. + +### Phase117: MoE Route-Once Boundary Timing + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-route-once-boundary-phase117.md`. +- Artifact: + `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. +- Trace env: + `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timings with + `LLAMA_MOE_BOUNDARY_TIMING=1`. +- Source decision: keep default-off diagnostic trace only; no runtime + optimization promoted. + +Gates: + +| run | result | +|-----|--------| +| post-guard selected default, `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` | `13/13` | +| post-guard trace/timing, `MOE_SWIGLU_DOWN` | `7/7`, `50` trace lines | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `MUL_MAT` | `1146/1146` | +| canonical `MUL_MAT_ID` | `806/806` | + +Perf / timing: + +| row | perf us | boundary medians | +|-----|--------:|------------------| +| graph-enabled `MOE_SWIGLU_DOWN n=128`, trace+timing guarded | `806.271923` | capture emits `us=-1` after graph warmup | +| no-graph `MOE_SWIGLU_DOWN n=128` | `821.530713` | gate_up: sort `8.992`, quant `103.840`, mmq `1218.656`; down: sort `8.800`, quant `50.720`, mmq `632.768`; GLU `26.240` | +| no-graph `MOE_SWIGLU_DOWN n=257` | `1079.544086` | gate_up: sort `13.376`, quant `185.632`, mmq `1297.728`; down: sort `13.952`, quant `83.808`, mmq `672.096`; GLU `51.232` | +| no-graph `MUL_MAT_ID_RAGGED_MOE n=128` | `1255.156250` | sort `8.896`, quant `99.232`, mmq `1133.472` | +| no-graph `MUL_MAT_ID_RAGGED_MOE n=257` | `1531.667683` | sort `14.624`, quant `174.464`, mmq `1263.360` | + +Notes: + +- Inline CUDA events cannot be synchronized inside CUDA graph capture. The + guard is required: graph-enabled timing no longer aborts, but captured + sections report `us=-1`; use `GGML_CUDA_DISABLE_GRAPHS=1` only for boundary + attribution. +- The route-sort bucket is small, and standalone GLU/down-quant is not enough + after the Phase116 flat result. Do not fund another small sort/tile/quant + shortcut from this evidence. +- Next source work should be a larger MoE pipeline: route-once metadata shared + by both expert GEMMs and/or whole-pattern GEMM1->activation->GEMM2 ownership. + +### Phase116: MoE SwiGLU Down Fused Quant + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-02-moe-swiglu-down-fused-quant-phase116.md`. +- Artifact: + `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. +- Env under test: + `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| run | selected gate | route marker | +|-----|---------------|--------------| +| control | `13/13` | n/a | +| initial candidate | `13/13` | absent | +| fix1 candidate | `13/13` | present, `6` hits | +| post-revert | `13/13` | n/a | + +Perf: + +| op | shape | control us | fused us | candidate change | +|----|-------|-----------:|---------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `806.332261` | `808.791633` | `-0.30%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1241.147837` | `1245.063702` | `-0.32%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1024.895706` | `1024.685072` | `+0.02%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `1454.116279` | `1455.965116` | `-0.13%` | + +Decision: + +- Reject and revert Phase116. +- The route is technically feasible without a new ggml op or MMQ kernel change, + but fusing only `SWIGLU` into MMQ activation quantization is too small to move + GB10 parity. +- Do not retry this exact standalone fused-quant path. The next credible fused + routed-MoE phase needs route-once metadata shared by both expert GEMMs plus a + larger fused GEMM1/activation/GEMM2 or weighted-combine/scatter boundary. + +### Phase115: MoE Small-M Sentinel A/B + +- Date: 2026-07-02. +- Plan: + `docs/superpowers/plans/2026-07-01-moe-small-m-sentinel-phase115.md`. +- Artifact: + `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. +- Env under test: + `LLAMA_MOE_SMALL_M_TILE=16`, `LLAMA_MOE_SMALL_M_TILE=32`, + `LLAMA_MOE_SMALL_M_TILE=64`. +- Source decision: no source change; reject as a parity lever. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| control | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=16` | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=32` | `13/13` | +| `LLAMA_MOE_SMALL_M_TILE=64` | `13/13` | + +Perf: + +| env | `MOE_SWIGLU_DOWN` 128 us | `MUL_MAT_ID_RAGGED_MOE` 128 us | `MOE_SWIGLU_DOWN` 257 us | `MUL_MAT_ID_RAGGED_MOE` 257 us | +|-----|-------------------------:|-------------------------------:|-------------------------:|-------------------------------:| +| control | `809.814159` | `1247.719952` | `1021.508130` | `1452.301136` | +| `LLAMA_MOE_SMALL_M_TILE=16` | `804.780370` | `1241.008413` | `1020.710366` | `1455.017442` | +| `LLAMA_MOE_SMALL_M_TILE=32` | `809.751408` | `1242.140625` | `1021.155488` | `1458.712209` | +| `LLAMA_MOE_SMALL_M_TILE=64` | `807.938858` | `1247.765625` | `1021.431911` | `1456.875000` | + +Decision: + +- Reject small-M row shaping for the current stack. +- This confirms the older Phase33 serving-level rejection on the newer + whole-graph sentinels: smaller MoE token tiles are correctness-safe, but the + 257-token ragged down path does not improve. +- Do not add a down-name special case or another tile-policy shortcut. Phase116 + should scope a fused routed-MoE kernel or graph-level fusion that avoids + materializing intermediate activation/output traffic. + +### Phase114: W4A16 Padded Routing + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-padded-routing-phase114.md`. +- Initial artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta`. +- Fix1 artifact: + `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_PADDED_META=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| run | control | candidate | +|-----|---------|-----------| +| initial padded metadata | `13/13` | `13/13` | +| fix1 with `num_tokens_post_pad` early returns | `13/13` | `13/13` | +| post-revert Phase112 control | `13/13` | n/a | + +Fix1 perf: + +| op | shape | Phase112 control us | Phase114 fix1 us | candidate change | +|----|-------|--------------------:|-----------------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `805.094932` | `804.176236` | `+0.11%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1243.722356` | `1245.055288` | `-0.11%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1477.876106` | `1726.273196` | `-16.81%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2163.346983` | `2650.932292` | `-22.54%` | + +Decision: + +- Reject and revert Phase114. +- The vLLM-style padded metadata contract is correctness-feasible in llama.cpp, + but a naive padded consumer does too much padded gather/GEMM/scatter work for + sparse expert occupancy on these GB10 test rows. +- Do not retry this exact padded-W4A16 route unless the kernel is changed to + avoid padded activation/output traffic, or the work shifts to a true fused + routed-MoE kernel where padding is part of the native tile scheduler. + +### Phase113: W4A16 Direct-A GPU Tiles + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-direct-a-gpu-tiles-phase113.md`. +- Artifact: + `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1`. +- Source decision: rejected and reverted. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| Phase112 control, `DIRECT_A=1 MOE_GPU_SORT=1` | `13/13` | +| Phase113 candidate, plus `W4A16_GPU_TILES=1` | `13/13` | +| post-revert Phase112 control | `13/13` | + +Perf: + +| op | shape | Phase112 control us | Phase113 candidate us | candidate change | +|----|-------|--------------------:|----------------------:|-----------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `808.130330` | `803.574960` | `+0.56%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1242.206731` | `1239.567308` | `+0.21%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1478.156342` | `1476.355457` | `+0.12%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2148.437500` | `2214.230603` | `-3.06%` | + +Canonical gates: + +- Skipped for the candidate because the perf gate failed. +- Post-revert selected gate passed `13/13`, restoring the accepted Phase112 + state on DGX. + +Decision: + +- Reject and revert Phase113. +- Do not spend more time on compact GPU tile descriptors for W4A16 unless the + GEMM itself consumes a vLLM-style padded metadata contract directly. +- The next credible MoE phase should move toward padded aligned metadata + (`sorted_token_ids`, expert-per-block ids, and padded row count) rather than + compact descriptors plus a ragged tile map. + +### Phase112: W4A16 Direct Activation Staging + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-direct-a-phase112.md`. +- Artifact: + `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`. +- Env under test: + `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1`. +- Source decision: keep default-off. + +Selected gates: + +| env | selected gate | +|-----|---------------| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_W4A16_DIRECT_A=1 LLAMA_MOE_GPU_SORT=1` | `13/13` | + +Perf: + +| op | shape | W4A16+GPU-sort us | direct-A us | direct-A+GPU-sort us | best change vs control | +|----|-------|------------------:|------------:|---------------------:|-----------------------:| +| `MOE_SWIGLU_DOWN` | `n_tokens=128` | `807.219630` | `805.847949` | `809.409493` | `-0.27%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=128` | `1242.664663` | `1245.671875` | `1247.674279` | `-0.40%` | +| `MOE_SWIGLU_DOWN` | `n_tokens=257` | `1551.081790` | `1576.045597` | `1477.738938` | `+4.73%` | +| `MUL_MAT_ID_RAGGED_MOE` | `n=257` | `2278.504464` | `2347.164352` | `2166.224138` | `+4.93%` | + +Canonical gates for direct-A+GPU-sort: + +| gate | result | +|------|--------| +| README MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| README dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `SSM_CONV` | `45/45` | +| `SSM_CONV_SPLIT` | `6/6` | +| `GET_ROWS` | `49/49` supported rows | +| `GATED_DELTA_NET` | `48/48` | +| `MUL_MAT` | `1146/1146` supported rows | +| `MUL_MAT_ID` | `806/806` | + +Note: the older handoff snippet with `-no-cnv -c 4096` produced stable but +non-canonical md5s (`18a4e85031694388bab85e5f5b03effc` and +`0764361176d94719ab94f82da12eed65`) for both the direct-A candidate and the +W4A16+GPU-sort control. Treat that as a harness mismatch, not a sanctioned +gate. The patch-series README gate without `-no-cnv` and without explicit +`-c 4096` is the canonical md5 gate used above. + +Decision: + +- Carry Phase112 as default-off only. +- The improvement is real for the larger Phase108 MoE rows, but it only narrows + the fallback path. W4A16 fallback is still not the default grouped-MMQ parity + path. +- Next target: either remove another W4A16 fallback boundary that remains after + direct-A, or shift to a fused routed-MoE kernel that avoids fallback entirely + while preserving the same md5/op gates. + +## Current Serving Record + +Phase72 broader serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`. + +Artifact: + +- `/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730` + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| llama default | `8` | `170.4` | `231.3` | `28.42` | `1693.4` | `786.4` | `3.004` | +| llama min32 | `8` | `158.5` | `218.4` | `26.27` | `1547.8` | `816.2` | `3.230` | +| vLLM | `8` | `260.0` | `305.9` | `37.32` | `4659.7` | `266.4` | `1.915` | +| llama default | `32` | `257.8` | `430.2` | `12.09` | `1720.4` | `2625.2` | `7.943` | +| llama min32 | `32` | `242.7` | `411.7` | `11.58` | `1617.4` | `2881.6` | `8.439` | +| vLLM | `32` | `463.6` | `601.0` | `17.60` | `5496.2` | `773.7` | `4.357` | +| llama default | `128` | `325.8` | `714.0` | `3.92` | `1628.8` | `7822.5` | `25.148` | +| llama min32 | `128` | `316.0` | `697.9` | `3.81` | `1606.0` | `8056.9` | `25.926` | +| vLLM | `128` | `666.4` | `1029.5` | `6.81` | `5292.5` | `2511.7` | `11.933` | + +Ratios: + +| n | min32/default agg | min32/default decode | min32/default TTFT | default decode/vLLM | min32 decode/vLLM | +|--:|------------------:|---------------------:|-------------------:|--------------------:|----------------:| +| `8` | `0.9302` | `0.9442` | `1.0379` | `0.7561` | `0.7140` | +| `32` | `0.9414` | `0.9570` | `1.0977` | `0.7158` | `0.6850` | +| `128` | `0.9699` | `0.9775` | `1.0300` | `0.6935` | `0.6779` | + +Decision: + +- Reject default-on for `LLAMA_TTFT_PREFILL_FIRST=1` + `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32`. +- Keep min32 as opt-in only. +- The opt-in regressed aggregate, decode, TTFT, and wall time at every tested + concurrency and widened the vLLM decode gap. + +## Attempt Log + +### Phase111: W4A16 GPU Tile Descriptor Probe + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-w4a16-gpu-tile-descriptors-phase111.md`. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: rejected and reverted. + - Probe added default-off `LLAMA_W4A16_GPU_TILES=1`. + - It built W4A16 tile descriptors on GPU from Phase110 `expert_bounds_dev` + with an atomic tile counter, then copied back one `n_tiles` integer for the + grouped W4A16 launch dimension. + - The final source returned to the Phase110 `LLAMA_MOE_GPU_SORT=1` state. +- Failed build/runtime artifact: + `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230216`. +- Measured artifact: + `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. + +Failure/fix notes: + +| attempt | result | cause | +|---------|--------|-------| +| initial DGX compile | failed | `expert_bounds_for_w4a16` was typed `const int32_t *` but `mm_ids_helper` writes expert bounds | +| first runtime artifact `20260701_230216` | aborted | CUDA pool LIFO assert: outer `expert_bounds_dev` was allocated after inner `ids_dst_dev` but freed later | +| fix1 artifact `20260701_230400_fix1` | selected gates passed | allocation order corrected; `LLAMA_W4A16_GPU_TILES=1` branch traced | +| post-revert gate | `13/13` | source restored to Phase110 behavior | + +Selected gates: + +| env | selected gate result | +|-----|----------------------| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `13/13` | +| post-revert `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | + +Clean perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs Phase110 GPU-sort | +|-----|------|-----------:|--------:|-------:|---------------------:| +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `128` | `807.037812` | `1243` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `257` | `1531.958716` | `654` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MOE_SWIGLU_DOWN` | `128` | `802.969697` | `1254` | `0.995` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MOE_SWIGLU_DOWN` | `257` | `1538.542813` | `654` | `1.004` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1244.568510` | `832` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2250.435268` | `448` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1243.544471` | `832` | `0.999` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1 LLAMA_W4A16_GPU_TILES=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2295.743304` | `448` | `1.020` | + +Trace facts: + +- `MOE_SWIGLU_DOWN n=257` built `128` W4A16 tiles for `2056` rows. +- `MUL_MAT_ID_RAGGED_MOE n=257` built `288` W4A16 tiles for `2056` rows. +- The clean perf rerun omitted `LLAMA_W4A16_GPU_TILES_TRACE=1`; the earlier + traced perf leg is preserved in the artifact but should not be used for timing. + +Decision: + +- Reject and revert Phase111 source. Moving only the W4A16 tile descriptor build + to GPU is correctness-clean after fixes, but it does not improve the parity + row and slightly regresses the most relevant 257-token ragged row. +- Do not spend another phase on a one-piece W4A16 host-metadata cleanup. The + next W4A16 attempt must remove a larger boundary, such as direct activation + consumption plus GPU descriptors in one path, or avoid the host-sync fallback + path entirely. + +### Phase110: GPU MoE Routing Metadata for Fallback/W4A16 + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-gpu-moe-routing-metadata-phase110.md`. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: new default-off CUDA source change in + `ggml/src/ggml-cuda/ggml-cuda.cu`. + - Add `LLAMA_MOE_GPU_SORT=1` to route fallback `ggml_cuda_mul_mat_id` + metadata construction through existing `ggml_cuda_launch_mm_ids_helper()`. + - Add a local inverse-permutation kernel because `mm_ids_helper` returns + sorted-to-original `ids_dst`, while fallback `get_rows_cuda()` needs + original-to-sorted `ids_from_sorted`. + - Leave graph-safe grouped-MMQ untouched. +- Failed first artifact: + `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224103`. +- Accepted artifact: + `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. + +Initial failure and fix: + +| artifact | env | selected gate result | reason | +|----------|-----|----------------------|--------| +| `20260701_224103` | default | `13/13` | baseline clean | +| `20260701_224103` | `LLAMA_W4A16_PREFILL_M=128` | `13/13` | fallback baseline clean | +| `20260701_224103` | `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `10/13` | wrong permutation direction for fallback `get_rows` | +| `20260701_224446_fix1` | default | `13/13` | accepted fix | +| `20260701_224446_fix1` | `LLAMA_W4A16_PREFILL_M=128` | `13/13` | accepted fix | +| `20260701_224446_fix1` | `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `13/13` | accepted fix; trace showed branch execution | + +Canonical gates: + +| env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs W4A16 | vs default | +|-----|------|-----------:|--------:|-------:|---------:|-----------:| +| default | `MOE_SWIGLU_DOWN` | `128` | `806.724859` | `1243` | n/a | `1.000` | +| default | `MOE_SWIGLU_DOWN` | `257` | `1022.161585` | `984` | n/a | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `809.339501` | `1243` | `1.000` | `1.003` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `1656.102310` | `606` | `1.000` | `1.620` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `128` | `807.311344` | `1243` | `0.997` | `1.001` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MOE_SWIGLU_DOWN` | `257` | `1536.868502` | `654` | `0.928` | `1.504` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1242.343750` | `832` | n/a | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1453.979651` | `688` | n/a | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1248.412260` | `832` | `1.000` | `1.005` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2428.586538` | `416` | `1.000` | `1.670` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1247.145433` | `832` | `0.999` | `1.004` | +| `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2237.145089` | `448` | `0.921` | `1.539` | + +Decision: + +- Keep Phase110 as a default-off structural base. It is md5/op clean after the + inverse-permutation fix and confirms vLLM-style GPU route metadata can replace + the CPU id scan for the host-sync fallback path. +- Do not promote it as a speed parity lever by itself. The W4A16 fallback + improves by `7.2%` on `MOE_SWIGLU_DOWN n=257` and `7.9%` on + `MUL_MAT_ID_RAGGED_MOE n=257`, but still remains about `1.5x` slower than + the default grouped-MMQ path. +- Phase111 should only build on this if it removes another fallback bottleneck: + either the remaining `expert_bounds` host copy / host tile descriptor build, + or a grouped W4A16 path that can consume GPU expert bounds directly. + +### Phase109: Existing MoE Prefill and Tile-Policy A/B + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was an env-only benchmark + attempt using the Phase108 perf CSV harness. +- Artifact: + `/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`. + +Perf A/B: + +| env | case | `n_tokens` | time_us | n_runs | vs default | +|-----|------|-----------:|--------:|-------:|-----------:| +| default | `MOE_SWIGLU_DOWN` | `128` | `800.802233` | `1254` | `1.000` | +| default | `MOE_SWIGLU_DOWN` | `257` | `1008.593373` | `996` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `805.747385` | `1243` | `1.006` | +| `LLAMA_W4A16_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `1646.679739` | `612` | `1.633` | +| `LLAMA_FP4_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `128` | `806.103781` | `1243` | `1.007` | +| `LLAMA_FP4_PREFILL_M=128` | `MOE_SWIGLU_DOWN` | `257` | `4070.191057` | `246` | `4.035` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MOE_SWIGLU_DOWN` | `128` | `810.080451` | `1243` | `1.012` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MOE_SWIGLU_DOWN` | `257` | `1024.869121` | `978` | `1.016` | +| `LLAMA_MOE_MMQ_X=64` | `MOE_SWIGLU_DOWN` | `128` | `806.358005` | `1243` | `1.007` | +| `LLAMA_MOE_MMQ_X=64` | `MOE_SWIGLU_DOWN` | `257` | `1008.191767` | `996` | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1241.417067` | `832` | `1.000` | +| default | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1445.333807` | `704` | `1.000` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1242.049279` | `832` | `1.001` | +| `LLAMA_W4A16_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2518.852500` | `400` | `1.743` | +| `LLAMA_FP4_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1244.775240` | `832` | `1.003` | +| `LLAMA_FP4_PREFILL_M=128` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `2898.838068` | `352` | `2.006` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1247.564904` | `832` | `1.005` | +| `LLAMA_MOE_DENSITY_MAX=9` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1438.245739` | `704` | `0.995` | +| `LLAMA_MOE_MMQ_X=64` | `MUL_MAT_ID_RAGGED_MOE` | `128` | `1246.139423` | `832` | `1.004` | +| `LLAMA_MOE_MMQ_X=64` | `MUL_MAT_ID_RAGGED_MOE` | `257` | `1434.058239` | `704` | `0.992` | + +`MOE_WEIGHTED_COMBINE` spot rows: + +| env | `n_tokens=128` | `n_tokens=257` | +|-----|---------------:|---------------:| +| default | `27.695333` | `67.423746` | +| `LLAMA_W4A16_PREFILL_M=128` | `27.502254` | `95.550477` | +| `LLAMA_FP4_PREFILL_M=128` | `27.687500` | `229.421474` | + +Correctness gates: + +| env | selected gate result | +|-----|----------------------| +| default | `13/13` | +| `LLAMA_W4A16_PREFILL_M=128` | `13/13` | +| `LLAMA_FP4_PREFILL_M=128` | `13/13` | +| `LLAMA_MOE_DENSITY_MAX=9` | `13/13` | +| `LLAMA_MOE_MMQ_X=64` | `13/13` | + +Trace notes: + +- The default/density route remained CUDA-graph-safe grouped MMQ: + `route=mmq host_sync=0`. +- For the 257-token ragged row the traced launch uses + `ncols_dst=2056`, `ncols_max=257`, `mmq_x=96`, `stream_k_blocks == ntiles_dst`, + and `fixup=0`. +- For 128-token rows the current default already selects `mmq_x=64`; raising + density or forcing 64 does not open a new path. + +Decision: + +- Reject existing W4A16 and FP4 large-M env routes for these Phase108 MoE + sentinel rows. They are correctness-clean but slower, especially at + `n_tokens=257`. +- Reject `LLAMA_MOE_DENSITY_MAX=9` and `LLAMA_MOE_MMQ_X=64` as parity levers. + The best `MUL_MAT_ID_RAGGED_MOE` improvement is only `0.5-0.8%` and + `MOE_SWIGLU_DOWN` is flat or worse. +- Do not spend Phase110 on another MMQ tile-policy shortcut. +- Next implementation should target the structural gap identified by the vLLM + audit: build routed-MoE sorted token/expert metadata on GPU and remove the + host ID readback/sync path from the grouped fallback/W4A16 path, while keeping + the graph-safe MMQ path untouched. + +### Phase108: MoE Whole-Graph Perf CSV Harness + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: measurement-only source change in + `tests/test-backend-ops.cpp`. + - Add existing `MOE_SWIGLU_DOWN`, `MOE_WEIGHTED_COMBINE`, and + `MUL_MAT_ID_RAGGED_MOE` whole-graph cases to `make_test_cases_perf()` for + `n_tokens=128` and `257`. + - Expand `--output csv` to use `test_result::get_fields()`, which includes + `time_us`, `flops`, `bandwidth_gb_s`, `memory_kb`, and `n_runs`. +- Artifact: + `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`. + +RED condition from Phase107: + +| command | Phase107 result | +|---------|-----------------| +| `test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv` | zero rows | + +Perf rows after patch: + +| case | params | time_us | n_runs | flops | +|------|--------|--------:|-------:|------:| +| `MOE_SWIGLU_DOWN` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048` | `801.764753` | `1254` | `12053007297164.449219` | +| `MOE_SWIGLU_DOWN` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048` | `1019.953252` | `984` | `19023274120980.359375` | +| `MOE_WEIGHTED_COMBINE` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=128,n_embd=2048` | `27.550055` | `36320` | `117074893979840.453125` | +| `MOE_WEIGHTED_COMBINE` | `type_a=nvfp4,n_mats=128,n_used=8,n_ff=768,n_tokens=257,n_embd=2048` | `67.593041` | `14800` | `95809244446043.828125` | +| `MUL_MAT_ID_RAGGED_MOE` | `type_a=nvfp4,n_mats=256,n_used=8,m=768,n=128,k=2048` | `1239.103365` | `832` | `2599642259062.170898` | +| `MUL_MAT_ID_RAGGED_MOE` | `type_a=nvfp4,n_mats=256,n_used=8,m=768,n=257,k=2048` | `1445.950284` | `704` | `4472917803025.495117` | + +Safety gates: + +| gate | result | +|------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MOE_SWIGLU_DOWN` | `7/7` | +| `MOE_WEIGHTED_COMBINE` | `7/7` | +| `MUL_MAT_ID_RAGGED_MOE` | `6/6` | +| `SSM_CONV` | `45/45` | +| `SSM_CONV_SPLIT` | `6/6` | +| `GET_ROWS` | `49/49` | +| `GATED_DELTA_NET` | `48/48` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Notes: + +- The first md5 attempt in `gates/` used `-no-cnv` and intentionally failed + against the canonical chat-template hashes. The corrected historical gate is + in `gates_chat/` and passed. +- CSV output is now a usable perf ledger for these cases; the schema includes + timing columns instead of support metadata only. + +Decision: + +- Phase108 closes the Phase107 measurement gap; it is not a parity-improving + runtime patch by itself. +- The dominant focused row is `MUL_MAT_ID_RAGGED_MOE` (`1239-1446 us/run`) and + `MOE_SWIGLU_DOWN` (`802-1020 us/run`), not `MOE_WEIGHTED_COMBINE` + (`28-68 us/run`). +- Next fused-MoE work should target the routed matmul/SWIGLU/down chain and + must report deltas against these Phase108 rows plus the same md5/op gates. + +### Phase107: Fused-MoE Structural Guardrail + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a correctness and + measurement-surface attempt for the next structural fused routed-MoE path. +- Artifact: + `/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`. + +Correctness guardrails: + +| guard | result | +|-------|--------| +| `MOE_SWIGLU_DOWN` | `7/7` | +| `MOE_WEIGHTED_COMBINE` | `7/7` | +| `MUL_MAT_ID_RAGGED_MOE` | `6/6` | + +Perf-output check: + +| command | result | +|---------|--------| +| `test-backend-ops perf -b CUDA0 -o MOE_SWIGLU_DOWN --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MOE_WEIGHTED_COMBINE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE --output csv` | zero rows | +| `test-backend-ops perf -b CUDA0 -o MUL_MAT_ID --output csv` | `116` support rows, `63` relevant rows, but no timing columns | + +Decision: + +- Existing correctness guardrails are sufficient to protect the three structural + MoE surfaces before a future source change. +- Existing `test-backend-ops perf` output is not sufficient as a performance + guard for these custom whole-graph cases because it emits support metadata, + not timings. +- The next source patch should be measurement-only: a narrow MoE fusion timing + harness that emits `case,iterations,total_ms,mean_ms` for the selected + `MOE_SWIGLU_DOWN`, `MOE_WEIGHTED_COMBINE`, and `MUL_MAT_ID_RAGGED_MOE` + shapes. +- Do not start fused routed-MoE kernel implementation until that timing harness + proves which sub-surface is large enough to move Phase104/106 serving. + +### Phase106: Max-Concurrency Current-Stack Serving + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a measurement-only + serving-contract attempt on top of the carried Phase101/102 default-off + cleanup candidates. +- Harness: streamed `paged-current-serving-snapshot.sh` with: + - source-log workaround for the non-git DGX mirror, + - paged env + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`, + - expanded gate ops: + `SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `NPL=128 192 256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, + `CTX=131072`, `BATCH=2048`, `UBATCH=512`, `VLLM_MAX_NUM_SEQS=256`. +- Artifacts: + - dry-run: + `/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214839_dryrun`, + - full sweep: + `/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`. + +Safety gates: + +| phase | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| pre | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| post | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged combined | `128` | `331.8` | `678.9` | `3.90` | `1734.1` | `7392.5` | `24.689` | +| paged combined | `192` | `318.4` | `681.8` | `2.50` | `1602.4` | `11058.0` | `38.595` | +| paged combined | `256` | `338.4` | `824.6` | `2.10` | `1542.8` | `14933.5` | `48.410` | +| vLLM | `128` | `663.4` | `1029.8` | `6.78` | `5228.9` | `2514.6` | `11.970` | +| vLLM | `192` | `709.8` | `1202.4` | `4.98` | `4881.5` | `3674.8` | `16.769` | +| vLLM | `256` | `723.8` | `1320.4` | `3.94` | `4520.9` | `4999.0` | `21.931` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6593` | `0.5752` | `0.5002` | `2.9398` | +| `192` | `0.5670` | `0.5020` | `0.4486` | `3.0091` | +| `256` | `0.6245` | `0.5330` | `0.4675` | `2.9873` | + +Decision: + +- Reject C1 as a GB10 parity lever for the current stack. +- llama.cpp completed `N=256`, but vLLM also completed `N=256` under the same + harness cap and remained materially faster. +- Higher concurrency did not reveal an aggregate operating point where llama.cpp + catches vLLM: paged aggregate stayed around `318-338 t/s`, while vLLM rose to + `724 t/s`. +- TTFT widened with higher concurrency on llama.cpp (`7392.5 -> 14933.5 ms`) + and stayed much lower on vLLM (`2514.6 -> 4999.0 ms`). +- The next phase should not be another scheduler or MMQ micro-policy. The + remaining plausible source work is structural: persistent batch state, fused + routed-MoE dispatch, or a larger GDN/packed-decode design with new guardrails. + +### Phase105: Current-Stack MoE MMQ Shape Refresh + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes. This was a measurement-only + attempt on top of the carried Phase101/102 default-off cleanup candidates. +- Env for trace legs: + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. +- Artifacts: + - gates: + `/home/mudler/bench/phase105_mmq_current_shape/20260701_213927`, + - serving trace retry: + `/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`. + +Safety gates: + +| gate | env | result | +|------|-----|--------| +| `MUL_MAT_ID_RAGGED_MOE` | default | `6/6` | +| `MUL_MAT_ID_RAGGED_MOE` | split + F16 K/V rows + shape traces | `6/6` | +| `MUL_MAT_ID` | split + F16 K/V rows | `806/806` | + +Trace refresh: + +| source | shape lines | launch lines | small-M lines | shape summary | launch summary | +|--------|------------:|-------------:|--------------:|---------------|----------------| +| ragged gate | `3` | `3` | `2` | density `2/4/9`, `mmq_x_best 40/64/96` | `fixup=0`, `stream_k_blocks == ntiles_dst` | +| one live serving request | `120` | `120` | `0` | `ncols_max=317`, density `10`, `mmq_x_best=112`, `stream_k=1` | `fixup=0`, `stream_k_blocks == ntiles_dst` (`120/120`), efficiency `100` | + +Notes: + +- The first live-serving trace leg used the wrong model path and exited before + loading the model. It is preserved in the gate artifact as a harness hiccup, + not an inference failure. +- The serving retry used `~/bench/q36-35b-a3b-nvfp4.gguf`; the request returned + a non-empty response (`3648` bytes), and the wrapper's nonzero exit was from + `grep` under `pipefail` when there were zero `SMALL_M` lines. + +Decision: + +- The current Phase104 stack did not create a new cheap grouped-MMQ lever. +- The trace reconfirms that no-fixup/no-stream-k shortcuts are closed for this + workload, and the live sampled shape is prefill-like rather than a new + small-M decode class. +- Do not pursue another host-side MMQ tile policy. Any next MMQ work must be a + structural kernel or serving-contract change with a clear path to reducing + the dominant `mmq_nvfp4` bucket. +- Given prior GDN micro-kernel rejections, the next high-value phase should be + a larger serving contract or a new structural design, not more isolated + micro-knobs. + +### Phase104: Combined Cleanup Normal Serving Snapshot vs vLLM + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes beyond the carried Phase101/102 + default-off runtime candidates. +- Harness: streamed `paged-current-serving-snapshot.sh` with: + - source-log workaround for the non-git DGX mirror, + - paged env + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`, + - expanded gate ops: + `SSM_CONV,SSM_CONV_SPLIT,GET_ROWS,GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `NPL=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, + `BATCH=2048`, `UBATCH=512`. +- Artifact: + `/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`. + +Safety gates: + +| phase | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| pre | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| post | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`, `N=128`: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged combined | `128` | `338.6` | `675.8` | `3.93` | `1813.0` | `7121.6` | `24.196` | +| vLLM | `128` | `661.1` | `1028.0` | `6.80` | `5208.7` | `2572.3` | `11.980` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6574` | `0.5779` | `0.5122` | `2.7686` | + +Comparison to Phase97 Phase93-only normal serving: + +| metric | Phase97 | Phase104 combined | change | +|--------|--------:|------------------:|-------:| +| `agg_tps` | `329.6` | `338.6` | `+2.73%` | +| `decode_agg_tps` | `669.8` | `675.8` | `+0.90%` | +| `prefill_tps` | `1734.5` | `1813.0` | `+4.53%` | +| `ttft_mean_ms` | `7415.4` | `7121.6` | `-3.96%` | +| `wall_s` | `24.851` | `24.196` | `-2.64%` | +| `paged_decode_over_vllm` | `0.6507` | `0.6574` | `+0.0067` | +| `paged_agg_over_vllm` | `0.4958` | `0.5122` | `+0.0164` | + +Decision: + +- The combined cleanup stack has a small real serving benefit outside `nsys`. +- It does not change the parity conclusion: vLLM is still about `1.52x` faster + on decode aggregate and `1.95x` faster on aggregate throughput at this shape. +- Carry the combined cleanup env as the best current comparison baseline. +- Next source work should target the remaining high-impact gap, not another + isolated layout cleanup. The current evidence points to larger serving + contracts or the dominant GDN/MMQ buckets. + +### Phase103: Combined Layout Cleanup Stack + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no new source changes beyond the Phase101 and Phase102 + default-off runtime candidates. +- Env: + `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. +- Artifacts: + - standalone combined gates: + `/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211632/gates_combined`, + - combined serving profile: + `/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| standalone combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post combined | split + F16 K/V rows | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under combined graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `212.3` | +| decode aggregate t/s | `331.5` | +| decode per-seq t/s | `2.13` | +| prefill t/s | `1569.1` | +| TTFT mean ms | `7858.5` | +| wall s | `38.575` | +| total kernel time | `19.5519 s` | + +Fine bucket comparison: + +| bucket | Phase101 opt-in | Phase102 opt-in | Phase103 combined | Phase103 vs Phase102 | +|--------|----------------:|----------------:|------------------:|---------------------:| +| `convert_dtype` | `661.35 ms` | `663.99 ms` | `662.36 ms` | `-1.63 ms` | +| `copy_layout` | `80.32 ms` | `112.53 ms` | `78.22 ms` | `-34.31 ms` | +| `concat_layout` | `433.13 ms` | `4.59 ms` | `12.51 ms` | `+7.92 ms` | +| `layout-copy` macro | `1220.30 ms` | `826.87 ms` | `798.52 ms` | `-28.35 ms` | +| `get_rows` | `277.67 ms` | `278.61 ms` | `278.61 ms` | `0.00 ms` | +| `gdn_conv` | `453.54 ms` | `383.90 ms` | `390.08 ms` | `+6.18 ms` | +| `gdn_core` | `5886.76 ms` | `5940.33 ms` | `5930.47 ms` | `-9.86 ms` | +| `mmq_nvfp4` | `6193.70 ms` | `5987.09 ms` | `6001.77 ms` | `+14.68 ms` | + +Decision: + +- Correctness-clean combined stack. The two cleanup candidates are compatible. +- The combination improves traced serving over Phase102 and recovers the + Phase101 `copy_layout` reduction while preserving the Phase102 concat removal. +- It is still not a parity-closing lever. Dominant buckets remain + `gdn_core 5930.47 ms` and `mmq_nvfp4 6001.77 ms`, far larger than the + residual layout buckets. +- Carry Phase101+Phase102 as a combined default-off cleanup stack for future + comparisons. Next source work should not spend more time on isolated + layout-copy cleanup unless it also changes a serving-critical contract. + +### Phase102: Split-Input `SSM_CONV` Prefill Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: default-off runtime candidate: + - adds `ggml_ssm_conv_split(ctx, conv_states, x_cur, conv_kernel)` while + reusing `GGML_OP_SSM_CONV`, + - adds CPU and CUDA split-input implementations plus `SSM_CONV_SPLIT` tests, + - wires Qwen3Next/Qwen35/Qwen35MoE through + `LLAMA_SSM_CONV_SPLIT=1` only for `n_seq_tokens > 1`, + `n_seq_tokens >= K-1`, and `cparams.n_rs_seq == 0`, + - keeps decode fused and rollback/short-prefill cases on the existing path. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Debug note: the first split-minus-base test used the default normalized-MSE + metric and failed with `ERR = inf` for `d_conv=4` because the CPU reference is + exactly zero. A direct split CUDA-vs-CPU diagnostic passed `6/6`; the final + semantic test keeps `split - base` and uses absolute max error. +- Artifacts: + - default/opt-in standalone gates: + `/home/mudler/bench/phase102_ssm_conv_split/20260701_210559`, + - opt-in serving profile: + `/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `SSM_CONV` | `SSM_CONV_SPLIT` | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|------------------|------------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| standalone opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post opt-in | `LLAMA_SSM_CONV_SPLIT=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `45/45` | `6/6` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under opt-in graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `206.1` | +| decode aggregate t/s | `320.0` | +| decode per-seq t/s | `2.06` | +| prefill t/s | `1538.0` | +| TTFT mean ms | `7928.4` | +| wall s | `39.743` | +| total kernel time | `19.5482 s` | + +Fine bucket comparison: + +| bucket | Phase100 | Phase101 opt-in | Phase102 opt-in | Phase102 vs Phase101 | +|--------|---------:|----------------:|----------------:|---------------------:| +| `convert_dtype` | `661.73 ms` | `661.35 ms` | `663.99 ms` | `+2.64 ms` | +| `copy_layout` | `116.25 ms` | `80.32 ms` | `112.53 ms` | `+32.21 ms` | +| `concat_layout` | `438.15 ms` | `433.13 ms` | `4.59 ms` | `-428.54 ms` | +| `layout-copy` macro | `1262.58 ms` | `1220.30 ms` | `826.87 ms` | `-393.43 ms` | +| `get_rows` | `283.47 ms` | `277.67 ms` | `278.61 ms` | `+0.94 ms` | +| `gdn_conv` | `458.13 ms` | `453.54 ms` | `383.90 ms` | `-69.64 ms` | +| `gdn_core` | `5919.48 ms` | `5886.76 ms` | `5940.33 ms` | `+53.57 ms` | +| `mmq_nvfp4` | `6127.44 ms` | `6193.70 ms` | `5987.09 ms` | `-206.61 ms` | + +Decision: + +- Correctness-clean and structurally useful: the split op removes the large + concat materialization from the eligible prefill/microbatch path. +- It does not improve live serving throughput in the profiled `N=128`, + `PTOK=128`, `GEN=64`, `PARALLEL=128` window; aggregate and decode are below + Phase100/101 traced profiles despite lower total kernel time. +- Carry as a default-off cleanup candidate pending repeat A/B or a follow-up + that fuses the remaining state update/copy work. Do not promote as a parity + lever by itself. +- Next higher-value work should target the still-dominant buckets: + `gdn_core` and `mmq_nvfp4`, or a larger serving scheduler/packed-decode + contract. + +### Phase101: Paged K/V F16 `GET_ROWS` A/B + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: default-off runtime candidate: + - `ggml_get_rows_type(ctx, a, b, type)` helper added while preserving stock + `ggml_get_rows` widening semantics, + - CPU reference supports F16 source -> F16 output row copy, + - CUDA already supports F16 `GET_ROWS` output through `get_rows_cuda`, + - paged attention K/V gather calls typed F16 `GET_ROWS` only when + `LLAMA_PAGED_KV_GET_ROWS_F16=1` and the K/V cache tensor is F16, + - tests add F16-output `GET_ROWS` cases. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Artifacts: + - default gates: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203621/gates_default`, + - opt-in gates: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203754/gates_optin`, + - opt-in serving profile: + `/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`. + +Safety gates: + +| gate | env | MoE md5 | dense md5 | `GET_ROWS` | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|------------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| standalone opt-in | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving pre opt-in raw log | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | +| serving post opt-in raw log | `LLAMA_PAGED_KV_GET_ROWS_F16=1` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `49/49` | `48/48` | `1146/1146` | `806/806` | + +Serving under opt-in graph-node profiling: + +| metric | value | +|--------|------:| +| aggregate t/s | `206.4` | +| decode aggregate t/s | `328.0` | +| decode per-seq t/s | `2.08` | +| prefill t/s | `1479.6` | +| TTFT mean ms | `8211.1` | +| wall s | `39.678` | +| total kernel time | `20.1989 s` | + +Fine bucket comparison against Phase100: + +| bucket | Phase100 | Phase101 opt-in | change | +|--------|---------:|----------------:|-------:| +| `convert_dtype` | `661.73 ms` | `661.35 ms` | `-0.38 ms` | +| `copy_layout` | `116.25 ms` | `80.32 ms` | `-35.93 ms` | +| `concat_layout` | `438.15 ms` | `433.13 ms` | `-5.02 ms` | +| `layout-copy` macro | `1262.58 ms` | `1220.30 ms` | `-42.28 ms` | +| `get_rows` | `283.47 ms` | `277.67 ms` | `-5.80 ms` | +| `gdn_core` | `5919.48 ms` | `5886.76 ms` | `-32.72 ms` | +| `mmq_nvfp4` | `6127.44 ms` | `6193.70 ms` | `+66.26 ms` | + +Decision: + +- Correctness-clean but not parity-closing. +- The hypothesis that K/V F16 typed gather would materially reduce + `convert_dtype` is mostly false for this serving window; `convert_dtype` + stayed flat. +- The patch does remove some `copy_layout` work and keeps md5/op gates green, + so it can remain as a small default-off cleanup candidate, but it should not + be promoted or treated as the main parity path without a repeat serving A/B. +- Next higher-value runtime work remains either the two-source `SSM_CONV` + contract for `conv_input` or a larger GDN/MMQ serving lever. + +### Phase100: Layout Trace View-Source Attribution + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: trace-only source change in + `ggml/src/ggml-cuda/ggml-cuda.cu`; `LLAMA_LAYOUT_TRACE` now prints + `dst_view`, `src0_view`, and `src1_view`. Default execution is unchanged. +- Local build: `cmake --build build --target test-backend-ops -j $(nproc)`. +- DGX build: + `cmake --build /home/mudler/llama-phase93-qwen3next-gqa-bcast/build --target llama-server llama-completion test-backend-ops -j $(nproc)`. +- Harness: + - trace gate: + `EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - serving profile: streamed `/home/mudler/bench/phase76_current_moe_profile.sh` + with source logging fixed for the mirror, `GATED_DELTA_NET` gates, and + `LLAMA_LAYOUT_TRACE=30000` on `llama-server`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifacts: + - trace gate: + `/home/mudler/bench/phase100_layout_view_trace/20260701_201635/trace_gates`, + - serving profile: + `/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`. + +Safety gates: + +| gate | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-------------------|-----------|--------------| +| trace-enabled standalone | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving pre raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving post raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling plus view-source layout trace: + +| metric | value | +|--------|------:| +| aggregate t/s | `207.0` | +| decode aggregate t/s | `327.9` | +| decode per-seq t/s | `2.10` | +| prefill t/s | `1490.9` | +| TTFT mean ms | `8302.7` | +| wall s | `39.578` | +| total kernel time | `20.3464 s` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `mmq_nvfp4` | `6127.44 ms` | `30.12%` | `33682` | +| `gdn_core` | `5919.48 ms` | `29.09%` | `4680` | +| `convert_dtype` | `661.73 ms` | `3.25%` | `52060` | +| `gdn_conv` | `458.13 ms` | `2.25%` | `7230` | +| `concat_layout` | `438.15 ms` | `2.15%` | `2130` | +| `copy_layout` | `116.25 ms` | `0.57%` | `8090` | +| `ew_repeat` | `46.45 ms` | `0.23%` | `18720` | + +View-source trace findings: + +| finding | evidence | +|---------|----------| +| K/V cache reads feed F32->F16 converts | For attention layers, `GET_ROWS` outputs F32 `node_*` from F16 `cache_k_l*` / `cache_v_l*`, then a `CPY` downcasts a view of that node to F16. Examples: `node_358 <- cache_k_l3` and `node_365 <- cache_v_l3`, followed by `cpy` rows with `src0_view=node_358` / `node_365`, `src0_type=f32`, `src1_type=f16`, and shapes like `256x64x2x8`, `256x128x2x8`, `256x162x2x8`. | +| The pattern repeats across attention layers | The same pair pattern appears for `cache_k_l7/cache_v_l7` (`node_798/node_805`), `cache_k_l11/cache_v_l11` (`node_1238/node_1245`), and later attention layers. | +| Some converts remain anonymous | `959` F32->F16 `CPY` trace rows still had no tensor or view names; do not assume the K/V path accounts for the full `convert_dtype` bucket without a targeted A/B. | +| Phase99 conv attribution is confirmed | `concat` rows show `conv_input-*` from `conv_states_reshaped-*` and `qkv_mixed_transposed-*`; the new view fields map `qkv_mixed_transposed-*` back to layer-local `node_*` producers. | + +Decision: + +- Carry the trace-only Phase100 patch as default-off instrumentation. +- The next runtime source candidate should target the attention K/V cache gather + dtype path: avoid `GET_ROWS` producing F32 only to downcast to F16 when the + consumer wants F16. This is more directly connected to the `convert_dtype` + bucket than a generic copy/layout tweak. +- Keep the two-source `SSM_CONV` contract as a separate later phase for + `concat_layout`; do not mix it with the K/V dtype experiment. + +### Phase99: Serving Layout Trace Attribution + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; the default-off `LLAMA_LAYOUT_TRACE` + hook was already present in the fork and DGX mirror. +- Harness: + - trace gate: + `EXTRA_ENV=LLAMA_LAYOUT_TRACE=128 OPS=GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - serving profile: streamed `/home/mudler/bench/phase76_current_moe_profile.sh` + with measurement-only edits for source logging, `GATED_DELTA_NET` gates, + and `LLAMA_LAYOUT_TRACE=30000` on `llama-server`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifacts: + - trace gate: + `/home/mudler/bench/phase99_layout_trace/20260701_200637/trace_gates`, + - serving profile: + `/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`. + +Safety gates: + +| gate | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-------------------|-----------|--------------| +| trace-enabled standalone | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving pre raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| serving post raw log | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling plus layout trace: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.2` | +| decode aggregate t/s | `332.9` | +| decode per-seq t/s | `2.12` | +| prefill t/s | `1476.8` | +| TTFT mean ms | `8466.3` | +| wall s | `39.341` | +| total kernel time | `20.2408 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6709.45 ms` | `33.15%` | +| MoE/FFN-GEMM | `6158.11 ms` | `30.42%` | +| bf16/fp8-proj | `2786.81 ms` | `13.77%` | +| layout-copy | `1269.35 ms` | `6.27%` | +| ew-mul(weight/norm/GDN) | `729.08 ms` | `3.60%` | +| act-quant | `686.52 ms` | `3.39%` | +| FA | `268.04 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `mmq_nvfp4` | `5936.34 ms` | `29.33%` | `34162` | +| `gdn_core` | `5920.40 ms` | `29.25%` | `4710` | +| `convert_dtype` | `662.34 ms` | `3.27%` | `52440` | +| `gdn_conv` | `457.47 ms` | `2.26%` | `7290` | +| `concat_layout` | `440.01 ms` | `2.17%` | `2130` | +| `copy_layout` | `119.16 ms` | `0.59%` | `8110` | +| `ew_repeat` | `47.83 ms` | `0.24%` | `18840` | + +Layout trace summary: + +| route | trace lines | +|-------|------------:| +| `get_rows` | `18779` | +| `cpy` | `4638` | +| `cont` | `4384` | +| `concat` | `2199` | + +Top attribution: + +| finding | evidence | +|---------|----------| +| `concat_layout` is conv input materialization | `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)`; top shapes include `45x8192x12x1 = 3x8192x12x1 + 42x8192x12x1` (`450` trace lines) and `49x8192x11x1 = 3x8192x11x1 + 46x8192x11x1` (`180` trace lines). | +| `copy_layout` includes conv state writeback | `conv_state_update-* = cpy(conv_state_last-*, conv_state_update-*)`; top grouped shapes include `24576x12x1x1 <- 3x8192x12x1` (`780` trace lines), `24576x11x1x1` (`420`), and `24576x13x1x1` (`270`). | +| `convert_dtype` needs stronger attribution | the trace sees many unnamed `CPY` rows with F32 source and F16 destination, e.g. `256x166x2x11`, `256x166x2x12`, and similar attention/KV-shaped tensors; names are not preserved by the current dispatch trace. | + +Decision: + +- Phase99 is a measurement-only phase; no runtime patch was carried or reverted. +- Do not spend more time on the Phase96-style conv-state identity shortcut. + The serving hot layout path is the prefill/microbatch `conv_input` concat + feeding `SSM_CONV`, not just decode update writeback. +- A conv-side source phase must be a larger two-source `SSM_CONV` contract that + reads `(conv_states, qkv_mixed)` as a logical concatenation, or it is too small + to fund. If not coding that, first extend trace attribution for the larger + unnamed F32->F16 `convert_dtype` bucket. + +### Phase98: Phase93 Serving Graph-Node Profile + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; this measured the carried Phase93 stack + after Phase95 and Phase96 reverts. +- Harness: + - streamed `/home/mudler/bench/phase76_current_moe_profile.sh` with two + measurement-only edits: + - source logging does not call `git` because the DGX Phase93 mirror is a + source copy without `.git`, + - pre/post gate ops include `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`, + - `SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast`, + - `BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`, + - `N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`. +- Artifact: + `/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`. + +Safety gates: + +| phase | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving under graph-node profiling, MoE `N=128`, `PTOK=128`, `GEN=64`, +`PARALLEL=128`: + +| metric | value | +|--------|------:| +| aggregate t/s | `208.4` | +| decode aggregate t/s | `332.0` | +| decode per-seq t/s | `2.12` | +| prefill t/s | `1488.1` | +| TTFT mean ms | `8315.5` | +| wall s | `39.296` | +| total kernel time | `20.0411 s` | + +Macro buckets: + +| bucket | time | share | +|--------|-----:|------:| +| GDN | `6679.96 ms` | `33.33%` | +| MoE/FFN-GEMM | `6034.52 ms` | `30.11%` | +| bf16/fp8-proj | `2766.06 ms` | `13.80%` | +| layout-copy | `1257.60 ms` | `6.28%` | +| ew-mul(weight/norm/GDN) | `726.03 ms` | `3.62%` | +| act-quant | `686.69 ms` | `3.43%` | +| FA | `265.00 ms` | `1.32%` | + +Fine buckets: + +| bucket | time | share | launches | +|--------|-----:|------:|---------:| +| `gdn_core` | `5892.99 ms` | `29.40%` | `4680` | +| `mmq_nvfp4` | `5809.55 ms` | `28.99%` | `33442` | +| `cublas_bf16_gemm` | `1745.83 ms` | `8.71%` | `22200` | +| `cutlass_bf16_gemm` | `740.22 ms` | `3.69%` | `26190` | +| `ew_mul` | `720.94 ms` | `3.60%` | `48326` | +| `act_quant` | `686.69 ms` | `3.43%` | `37526` | +| `convert_dtype` | `663.45 ms` | `3.31%` | `51300` | +| `gdn_conv` | `457.11 ms` | `2.28%` | `7260` | +| `concat_layout` | `430.25 ms` | `2.15%` | `2100` | +| `get_rows` | `283.56 ms` | `1.41%` | `27978` | +| `gdn_gather` | `231.32 ms` | `1.15%` | `360` | +| `mm_ids` | `119.93 ms` | `0.60%` | `16680` | +| `gdn_l2norm` | `98.54 ms` | `0.49%` | `9360` | +| `gemv_moe_q` | `81.77 ms` | `0.41%` | `1560` | + +Decision: + +- Phase98 confirms the serving hot path is still a two-bucket problem: + `gdn_core` and `mmq_nvfp4` together account for `58.39%` of kernel time. +- The repeated negative GDN micro-tries (Phase91, Phase92, Phase95, Phase96) + argue against more scalar/launch/gather shortcuts. A credible GDN follow-up + needs a larger recurrence design with a measured PoC, not another local tweak. +- `layout-copy` is now large enough (`6.28%`, led by `convert_dtype` and + `concat_layout`) to deserve attribution before code changes, but it is not + parity-closing by itself. +- Next phase should either: + - attribute `convert_dtype`/`concat_layout` to exact graph nodes and remove a + proven material copy, or + - pursue a larger `gdn_core`/`mmq_nvfp4` serving lever with a strict PoC gate. + +### Phase97: Phase93 Serving Snapshot, N=128 + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; this measured the carried Phase93 stack + after Phase95 and Phase96 reverts. +- Harness: + - streamed `paged-current-serving-snapshot.sh` with a one-line source-log + workaround because the DGX Phase93 mirror is a source copy without `.git`, + - `SRC=/home/mudler/llama-phase93-qwen3next-gqa-bcast`, + - `BUILD_DIR=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build`, + - `BIN=/home/mudler/llama-phase93-qwen3next-gqa-bcast/build/bin`, + - `NPL=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, `CTX=131072`, + - gate ops: `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`. +- Artifact: + `/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`. + +Safety gates: + +| phase | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-------------------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `48/48` | `1146/1146` | `806/806` | + +Serving snapshot, MoE `PTOK=128`, `GEN=64`, `PARALLEL=128`, `N=128`: + +| arm | n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|-----|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| paged Phase93 | `128` | `329.6` | `669.8` | `3.85` | `1734.5` | `7415.4` | `24.851` | +| vLLM | `128` | `664.8` | `1029.4` | `6.79` | `5271.8` | `2519.5` | `11.929` | + +Ratios: + +| n | paged decode/vLLM | paged perseq/vLLM | paged agg/vLLM | paged TTFT/vLLM | +|--:|------------------:|------------------:|---------------:|----------------:| +| `128` | `0.6507` | `0.5670` | `0.4958` | `2.9432` | + +Decision: + +- Phase93 remains a valid decode-profile improvement, but it is not + serving-parity at `n=128`. +- The Phase97 paged aggregate is slightly above the Phase72 default snapshot + (`329.6` vs `325.8`), and TTFT improves (`7415.4 ms` vs `7822.5 ms`), but + decode aggregate is lower than Phase72 (`669.8` vs `714.0`) while vLLM stays + essentially unchanged (`1029.4` vs `1029.5`). +- Treat Phase93 as worth carrying for source quality and decode-profile gain, + but the next parity phase needs a larger serving-impact lever. More isolated + GDN/conv micro-optimizations are unlikely to close the live serving gap. + +### Phase96: Conv-State Identity Fast Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: runtime model-graph change reverted after profiling; + Phase93 is still the current carried source. +- Rationale: + - The Phase93 decode profile showed `ssm_conv_update_ids_f32`/`gdn_conv` + around the 66-72 ms range, larger than the cleanly attributable remaining + GDN producer math. + - The recurrent GDN path already uses a direct in-place op when + `s_copy_main` is identity. This trial added the same shape of branch to + `build_conv_state_fused`: when `inp->s_copy_main_identity` was true, it + viewed the active conv-state cache slots directly and called + `ggml_ssm_conv_update_inplace` instead of the ids variant. + - The existing `build_rs` zero/extra-state maintenance stayed around the + lambda, and the CUDA update kernel loads the conv window before writing the + same slot, so the identity aliasing was expected to be safe. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214023/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `SSM_CONV` | `45/45` | +| DGX CUDA `SSM_CONV` | `45/45`, `Backend CUDA0: OK` | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `SSM_CONV` | `45/45`, `Backend CUDA0: OK` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `74 -> 96`, default env: + +| arm | total kernel s | GDN ms | `gdn_core` ms | `gdn_core` launches | `gdn_conv` ms | `mmq_nvfp4` ms | +|-----|---------------:|-------:|--------------:|--------------------:|--------------:|---------------:| +| Phase93 default | `3.5476` | `1409.19` | `1333.48` | `570` | about `66.40` to `72.26` | `1421.63` | +| Phase96 conv identity | `3.6723` | `1486.12` | `1406.57` | `600` | `70.42` | `1433.84` | + +Decision: + +- Reject the conv-state identity fast path. It is inference-safe, but it did + not improve `gdn_conv` and worsened total kernel time and `gdn_core` versus + Phase93. +- Revert the runtime model-graph change and keep Phase93 as the current carried + candidate. +- Do not retry the conv identity branch as a speed lever unless a same-window + trace shows the ids variant itself is materially slower than the direct + variant independent of launch-count/capture variance. + +### Phase95: GDN Warp Scalar-Gate Broadcast + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: runtime CUDA change reverted after profiling; Phase93 is + still the current carried source. +- Env: + - `GDN_WARP_SCALAR_GATE=1` +- Rationale: + - After Phase93, the remaining GDN producer buckets are small while + `gdn_core` remains the largest target. + - The scalar non-KDA decode path loads one scalar gate value per + `(head, seq, token)`, but every lane computes `expf(*g_t)`. This + default-off trial computed the scalar gate on lane 0 and broadcast it within + the warp for the one-token `S_v=128`, non-KDA, default `16x8` decode path. + - The recurrence order, reductions, state update, and stores were unchanged. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213150/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `GATED_DELTA_NET` | `48/48` | +| local CPU `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| DGX CUDA `GATED_DELTA_NET`, `GDN_WARP_SCALAR_GATE=1` | `48/48`, `Backend CUDA0: OK` | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS`, `GDN_WARP_SCALAR_GATE=1` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `65 -> 87`, `PROFILE_ENV=GDN_WARP_SCALAR_GATE=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase93 default | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | +| Phase95 warp scalar gate | `3.6317` | `1483.44` | `40.85%` | `1402.40` | `599` | `1402.88` | + +Decision: + +- Reject `GDN_WARP_SCALAR_GATE=1`. It is inference-safe, but worsens the target + `gdn_core` bucket by `+68.92 ms` and total kernel time by `+84.1 ms` versus + Phase93. +- Revert the runtime CUDA change and keep Phase93 as the current carried + candidate. +- Do not retry scalar-gate warp broadcast unless a future profile shows SFU + pressure, rather than recurrent state traffic/reductions, dominating the + decode GDN core. + +### Phase94: Phase93 GDN Geometry Reprobe, 8x8 + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: no source change; env-only geometry probe rejected. +- Env: + - `GDN_NW=8` + - `GDN_CPW=8` +- Rationale: + - Phase93 changed the active GDN launch mix and dropped `gdn_core` to the + current best `1333.48 ms`. + - The 8x8 geometry keeps a single S_v=128 column tile (`grid.z=1`) like the + default 16x8 path, but halves threads per block. This tested whether lower + block occupancy pressure helped after grouped Q/K broadcast. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211730/canonical_gates_8x8`, + - decode-only profile: + `/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`. + +Safety gates: + +| check | result | +|-------|--------| +| DGX CUDA `GATED_DELTA_NET`, `GDN_NW=8 GDN_CPW=8` | `48/48`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `74 -> 96`, `PROFILE_ENV=GDN_NW=8 GDN_CPW=8`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase93 default geometry | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | +| Phase94 8x8 geometry | `3.6223` | `1522.02` | `42.02%` | `1440.79` | `600` | `1352.68` | + +Decision: + +- Reject `GDN_NW=8 GDN_CPW=8` for Phase93. It is inference-safe, but worsens + the target `gdn_core` bucket by `+107.31 ms` and total kernel time by + `+74.7 ms`. +- Keep the Phase93 default `16x8` geometry. +- The profile also shows remaining producer-side GDN work is small compared with + recurrence core: `l2_norm_f32 8.65 ms`, GDN gate/sigmoid kernels about + `12.75 ms`, and remaining repeat `5.34 ms` in the Phase93 default trace. The + next candidate should target recurrence work or a larger packed decode + contract, not another small producer-only fusion. + +### Phase93: Qwen3Next Grouped Q/K Broadcast for Fused GDN + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase93-qwen3next-gqa-bcast`. +- Local patch status: carried as a positive candidate. +- Patch scope: + - added `ggml_gated_delta_net_set_bcast(tensor, grouped)` using + `op_params[2]`, + - kept default GDN Q/K head mapping as the existing tiled/modulo behavior, + - added grouped mapping for opt-in GDN calls: + `qk_head = value_head / (H_v / H_k)`, + - threaded the grouped flag through CPU GDN, CUDA sequential decode, and CUDA + chunked prefill kernels, + - changed Qwen3Next to skip the explicit q/k repeat only when the GDN op path + can consume grouped broadcast, + - added grouped broadcast backend-op coverage for one-token and prompt-sized + `GATED_DELTA_NET`. +- Build artifact: + `/home/mudler/llama-phase93-qwen3next-gqa-bcast/build`. +- Gate and profile artifacts: + - canonical gates: + `/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_210857/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU `GATED_DELTA_NET` | `48/48`, includes grouped AR and PP cases | +| local CPU `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| DGX CUDA `GATED_DELTA_NET` | `48/48`, includes grouped AR and PP cases | +| DGX CUDA `GATED_DELTA_NET_INPLACE_IDS` | `6/6` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `48/48`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `73 -> 94`, default env: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | +| Phase92 store-fused | `3.7419` | `1609.81` | `43.02%` | `1529.72` | `600` | `1383.82` | +| Phase93 Qwen3Next grouped broadcast | `3.5476` | `1409.19` | `39.72%` | `1333.48` | `570` | `1421.63` | + +Decision: + +- Carry Phase93. It is md5/op clean and improves the target `gdn_core` bucket by + `-57.08 ms` vs Phase87 same-source default, `-91.86 ms` vs Phase85 + identity-state (`1400.34 ms`), and `-92.0 ms` vs the rejected Phase91 pack2 + trial. +- The win is consistent with the intended work reduction: Qwen3Next stops + materializing repeated q/k heads for fused GDN and lets the op map value heads + to grouped q/k heads directly. +- Next follow-up should profile/count node-level repeat/layout buckets around + Qwen3Next GDN to confirm whether more vLLM-style packed decode producer work + remains worth porting. + +### Phase92: Scalar Decode Store-Fused GDN Trial + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase92-gdn-store-fused`, default-off CUDA + experiment on top of the Phase90/91 guardrail stack. +- Local patch status: runtime CUDA changes reverted after profiling; guardrail + stack remains. +- Patch scope: + - added a `STORE_FUSED` CUDA kernel instantiation behind + `GDN_SCALAR_DECODE_STORE_FUSED=1`, + - gated it to S_v=128, scalar-gate, final-state, one-token, in-place decode + with default geometry, + - wrote `state_dst` inside the scalar update loop and skipped the final + post-token register-store loop for that instantiation. +- Build artifact: + `/home/mudler/llama-phase92-gdn-store-fused/build`. +- Guardrail and gate artifacts: + - canonical gates: + `/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204550/canonical_gates`, + - decode-only profile: + `/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`. + +Safety gates: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` OK | +| local CPU guardrail | `GATED_DELTA_NET_INPLACE_IDS` `6/6`, `Backend CPU: OK` | +| DGX CUDA guardrail, `GDN_SCALAR_DECODE_STORE_FUSED=1` | `6/6`, `Backend CUDA0: OK` | +| canonical MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| canonical dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| canonical `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| canonical `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| canonical `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | +| profile pre/post md5/op gates | all OK | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after median +depth `72 -> 94`, `PROFILE_ENV=GDN_SCALAR_DECODE_STORE_FUSED=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | +| Phase92 store-fused | `3.7419` | `1609.81` | `43.02%` | `1529.72` | `600` | `1383.82` | + +Decision: + +- Reject and revert the store-fused runtime patch. It is inference-safe under + the current md5/op gates, but it worsens the target `gdn_core` bucket by + `+139.16 ms` vs Phase87 same-source default and `+104.28 ms` vs the already + rejected Phase91 pack2 trial. +- The extra in-loop global stores likely increase pressure/ordering cost enough + to outweigh removing the final register pass. Do not retry this shape unless + a profile shows the final store loop as independently dominant. +- Next higher-value direction from the vLLM code audit is not another + recurrence micro-loop tweak; scope the larger packed decode contract or the + Qwen3Next GQA-repeat removal as separate, guarded phases. + +### Phase91: Default-off PACK=2 Decode Kernel, Guarded Retry + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase91-gdn-pack2-guarded-source`, default-off + CUDA experiment on top of the Phase90 guardrail stack. +- Local patch status: runtime CUDA changes reverted after profiling; Phase90 + test guardrail remains. +- Patch scope: + - reintroduced a `GDN_DECODE_PACK2=1` F32 scalar-gate, one-token, + in-place decode kernel that packs two sequences into one CTA, + - added a PDL-safety fix after the first canonical md5 failure: inactive + odd/single sequence lanes now call `ggml_cuda_pdl_sync()` before returning, + - extended the guardrail with F32 `n_seqs=1` and `n_seqs=3` + output-plus-state cases. +- Build artifact: + `/home/mudler/llama-phase91-gdn-pack2-guarded-source/build`. +- Guardrail artifacts: + - initial `n_seqs=2` guardrail pass: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_201943/guardrail`, + - initial canonical md5 failure: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202024/canonical_gates`, + - PDL-fix expanded guardrail pass: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202140/guardrail_pdl_fix`, + - PDL-fix canonical gates with `GATED_DELTA_NET,MUL_MAT,MUL_MAT_ID`: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202154/canonical_gates_pdl_fix`, + - decode-only profile: + `/home/mudler/bench/phase91_gdn_pack2_guarded/20260701_202425/decode_profile_pdl_fix`. + +Safety gates: + +| check | result | +|-------|--------| +| initial Phase90 guardrail, `GDN_DECODE_PACK2=1` | `4/4`, `Backend CUDA0: OK` | +| initial canonical MoE md5 | failed: `b93724e88460d90379c5009df0e1f2b6` vs `8cb0ce23777bf55f92f63d0292c756b0` | +| expanded guardrail after PDL fix | `6/6`, covers F32 `n_seqs=1,2,3` output-plus-state | +| PDL-fix MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| PDL-fix dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| PDL-fix `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| PDL-fix `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| PDL-fix `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Decode-only profile, MoE `N=128`, `N_PREDICT=2048`, capture after +median depth `66 -> 88`, `PROFILE_ENV=GDN_DECODE_PACK2=1`: + +| arm | total kernel s | GDN ms | GDN % | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|---------------:|-------:|------:|--------------:|--------------------:|---------------:| +| Phase87 same-source default | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase85 identity state | `3.6622` | `1480.21` | `40.42%` | `1400.34` | `596` | `1437.53` | +| Phase91 pack2 PDL-fix | `3.5813` | `1505.91` | `42.05%` | `1425.44` | `598` | `1333.39` | + +Decision: + +- Reject and revert the pack2 runtime patch. It is inference-safe after the PDL + fix, but it worsens the target `gdn_core` bucket by `+34.88 ms` vs the + Phase87 same-source default and `+25.10 ms` vs Phase85. +- Keep the expanded Phase90/91 `GATED_DELTA_NET_INPLACE_IDS` guardrail cases + because they caught the missing odd/single sequence coverage. +- Do not retry CTA-level sequence packing without a different per-sequence work + reduction; packing alone raises GDN's share of total kernel time. + +### Phase90: In-place GDN Decode State Guardrail + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source`, + test-only experiment on top of the current Phase85 carry-forward stack. +- Local patch status: kept as a guardrail candidate in + `tests/test-backend-ops.cpp`. +- Patch scope: + - fixes the in-place ids fixture initialization by mirroring the identity + source cache bytes into `state_dst` after random tensor initialization, + - adds F32 serving-shape cases: `head_count=4`, `head_size=128`, + `n_seqs=2`, scalar gate and KDA, + - makes those F32 cases return `concat(flatten(out), flatten(state_dst))`, + so the normal backend comparator validates both attention output and the + recurrent-state side effect. +- Build artifact: + `/home/mudler/llama-phase90-gdn-inplace-ids-guardrail-source/build`. +- Gate artifacts: + - stale-source assertion: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_200946/direct`, + - output-only corrected pass: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201058/direct`, + - output-plus-state corrected pass: + `/home/mudler/bench/phase90_gdn_inplace_ids_guardrail/20260701_201257/direct`. + +DGX verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j $(nproc)` completed | +| local CPU selected op | `4/4`, including F32 `check_state=1` cases | +| DGX CUDA selected op, stale source | failed before comparison on BF16 `state_dst` F32-only assert | +| DGX CUDA selected op, corrected output-only source | `4/4`, `Backend CUDA0: OK` | +| DGX CUDA selected op, output plus state | `4/4`, `Backend CUDA0: OK` | + +Decision: + +- Keep this as the minimum guardrail for the next packed decode attempt. It + covers the Phase88 target shape (`S_v=128`, one-token decode, two sequences) + and observes the side-effect `state_dst` update for F32 scalar-gate and KDA + cases. +- BF16 in-place ids cases remain output-only in this fixture; use canonical md5 + gates for full-model BF16 inference safety. +- Do not profile Phase90: it is a test harness/guardrail attempt, not a runtime + performance candidate. + +### Phase89: In-place GDN Decode Test Guardrail Attempt + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase89-gdn-decode-gate-source`, test-only + experiment on top of the reverted Phase88 source. +- Local patch status: reverted after the targeted test filter failed. +- Patch scope: + - temporarily added two `test_gated_delta_net_inplace_ids` cases in + `tests/test-backend-ops.cpp`: + - F32, `head_count=4`, `head_size=128`, `n_seqs=2`, scalar gate, + - F32, `head_count=4`, `head_size=128`, `n_seqs=2`, KDA. +- Build artifact: + `/home/mudler/llama-phase89-gdn-decode-gate-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase89-gdn-decode-gate-source/configure.phase89.log` + - `/home/mudler/llama-phase89-gdn-decode-gate-source/build.phase89.log` +- Gate artifact: + `/home/mudler/bench/phase89_gdn_decode_gate/20260701_175903/direct`. + +DGX verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j 8` completed | +| local run | local CPU backend skipped for this op set | +| CUDA `GATED_DELTA_NET` filter | `46/46`, `Backend CUDA0: OK` | +| CUDA `GATED_DELTA_NET_INPLACE_IDS` filter | failed `0/4`, including both newly added F32 cases and the two pre-existing BF16 cases | + +Decision: + +- Reject and revert the test-only change. The direct + `GATED_DELTA_NET_INPLACE_IDS` filter is not currently a reliable green + guardrail, because the existing BF16 cases fail when selected directly. +- Do not add more packed decode source until there is a focused harness for the + serving decode shape that compares both attention output and the side-effect + `state_dst` update against the existing sequential kernel. + +### Phase88: Default-off PACK=2 Decode CTA Kernel + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase88-gdn-pack2-source`, one-file CUDA + experiment on top of Phase85. +- Local patch status: reverted after md5 failure. +- Patch scope: + - added `gated_delta_net_decode_pack2_cuda` in + `ggml/src/ggml-cuda/gated_delta_net.cu`, + - gated it behind `GDN_DECODE_PACK2=1`, + - limited it to F32 state, scalar-gate, `S_v == 128`, `n_tokens == 1`, + in-place decode, with no `GDN_NW/GDN_CPW` override, + - attempted to preserve the existing `(16,8)` per-column math order while + packing two independent sequences into one CTA. +- Build artifact: + `/home/mudler/llama-phase88-gdn-pack2-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase88-gdn-pack2-source/configure.phase88.log` + - `/home/mudler/llama-phase88-gdn-pack2-source/build.phase88.log` +- Gate artifact: + `/home/mudler/bench/phase88_gdn_pack2_gates/20260701_175059/direct`. +- Profile artifact: none. Profiling was skipped because the md5 gate failed. + +DGX gates with `GDN_DECODE_PACK2=1`: + +| check | result | +|-------|--------| +| MoE md5 | failed, got `320b5ed679844cbfd6f18d85d7ae32b0`, expected `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | failed, got `6a65e9d9e47321ebce9e461c8abf036c`, expected `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `Backend CUDA0: OK` | +| `MUL_MAT` | `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `Backend CUDA0: OK` | + +Observed output symptom: + +- MoE output duplicated the opening `` marker. +- Dense output degenerated into repeated `/` characters immediately after the + opening `` marker. + +Decision: + +- Reject and revert. The sacred greedy md5 gate failed, so no profile was run. +- The existing `test-backend-ops -o GATED_DELTA_NET` set did not catch this + because it does not cover the exact serving decode shape that triggers the + pack2 path. Before another packed decode attempt, add or script a focused + `n_seq_tokens=1`, `n_seqs > 1`, in-place F32 state equivalence gate against + the existing sequential kernel. +- Do not carry the pack2 kernel in the patch stack. + +### Phase87: Decode Geometry Probe `(GDN_NW=4, GDN_CPW=8)` + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase87-gdn-4x8-source`, one-line CUDA + dispatcher experiment on top of Phase85: + expose `launch_gdn_variant<128, ..., NUM_WARPS=4, COLS_PER_WARP=8>` through + the existing `GDN_NW/GDN_CPW` env sweep. +- Local patch status: reverted after profiling. The attempt was env-gated and + never made default. +- Build artifact: + `/home/mudler/llama-phase87-gdn-4x8-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase87-gdn-4x8-source/configure.phase87.log` + - `/home/mudler/llama-phase87-gdn-4x8-source/build.phase87.log` +- Gate artifact: + `/home/mudler/bench/phase87_gdn_4x8_gates/20260701_174014/direct`. +- Profile artifact: + `/home/mudler/bench/phase87_gdn_4x8_profile/20260701_174310`. +- Result type: source geometry probe. The hypothesis was that a `4*8 = 32` + column tile would be closer to vLLM's `BV=32` decode program shape while + preserving the existing per-column reduction order. + +DGX gates with `GDN_NW=4 GDN_CPW=8`: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `Backend CUDA0: OK` | +| `MUL_MAT` | `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `Backend CUDA0: OK` | + +Same-source decode-only profile: + +| arm | source | env | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `mmq_nvfp4` ms | +|-----|--------|-----|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|---------------:| +| default geometry | `/home/mudler/llama-phase87-gdn-4x8-source` | default `(16,8)` | `128` | `74` | `96` | `3.6310` | `1471.27` | `40.52%` | `1390.56` | `598` | `1416.46` | +| Phase87 4x8 | `/home/mudler/llama-phase87-gdn-4x8-source` | `GDN_NW=4 GDN_CPW=8` | `128` | `71` | `92` | `3.5988` | `1493.66` | `41.50%` | `1417.13` | `569` | `1396.11` | + +Decision: + +- Reject. The target bucket regressed by `+26.57 ms` (`+1.91%`) despite lower + total kernel time from unrelated `mmq_nvfp4` variance. +- Reverted the one-line dispatcher addition. Do not carry this in the patch + stack. +- The subagent/code audit points to a different Phase88 shape: keep the current + `(16,8)` per-column math order and pack two independent sequences per CTA, or + implement a fuller vLLM-style packed decode kernel that fuses producer math + and recurrence. + +### Phase86: Producer-fusion Scope Audit + +- Date: 2026-07-01. +- Source: no source patch. This is a profile-backed scope rejection using the + Phase85 node-traced DGX artifact before spending code on a small-ceiling + fusion. +- Input profile artifact: + `/home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856`. +- Source audit: + - `ggml/src/ggml-cuda/ggml-cuda.cu` already fuses + `{ GGML_OP_UNARY, GGML_OP_MUL }` for `SILU`, `SIGMOID`, and `SOFTPLUS`, + covering the expensive part of `alpha_softplus * ssm_a`. + - Qwen35 and Qwen35MoE still compute beta sigmoid and the alpha bias/softplus + producer as separate graph pieces, but those pieces are small in the + decode-only trace. + - vLLM's Triton producer fusion remains a useful design reference, but its + isolated producer scope is not the main GB10 bottleneck in this llama.cpp + profile. +- Gate artifact: not applicable, no binary changed. +- Result type: no-code benchmark/scope attempt. The benchmark record below is + copied from the Phase85 candidate profile because Phase86 deliberately asks + whether a source patch is worth writing. + +Same-window profile evidence: + +| bucket | time | share | launches | interpretation | +|--------|-----:|------:|---------:|----------------| +| total kernel time | `3.6622 s` | `100.00%` | - | Phase85 identity-state candidate capture | +| `GDN` macro | `1480.21 ms` | `40.42%` | `2980` | target family remains dominant | +| `gdn_core` | `1400.34 ms` | `38.24%` | `596` | real parity lever must reduce this bucket | +| `act/GDN-gate(shared)` macro | `13.57 ms` | `0.37%` | `3771` | entire producer/gate-side ceiling is tiny | +| `gated_act_silu_sigmoid` | `10.84 ms` | `0.30%` | `1786` | already includes fused unary-gated kernels | +| `gdn_sigmoid` | `2.73 ms` | `0.07%` | `1985` | beta sigmoid ceiling | +| `unary_op_kernel<&op_softplus>` | about `1.08 ms` | about `0.03%` | `596` | alpha softplus standalone signal from `nsys stats` | + +Decision: + +- Reject a narrow Phase86 producer-only implementation. Even deleting the whole + `act/GDN-gate(shared)` macro would improve the captured total by only + `0.37%`, and deleting only the still-unfused beta sigmoid would be about + `0.07%`. +- Do not modify or gate source for this phase. It would add upstream conflict + surface without meaningful parity upside. +- Phase87 should target a packed decode GDN kernel, inspired by vLLM's decode + path, that reduces launches and memory traffic inside `gdn_core` itself while + preserving the default F32 recurrent S-cache and md5/op gates. + +### Phase85: Identity-contiguous GDN State Fast Path + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase85-gdn-identity-state-source`, local + eight-file experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Local patch scope: + - carry forward Phase84 attention-only in-place GDN output cleanup, + - add a side-effect-free `llama_memory_recurrent_context::s_copy_main_is_identity`, + - store that identity bit in `llm_graph_input_rs`, + - include it in base and hybrid graph reuse checks, + - call `ggml_gated_delta_net_inplace` on a direct state view when active + recurrent rows are identity-contiguous, otherwise keep the ids path. +- Build artifact: + `/home/mudler/llama-phase85-gdn-identity-state-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase85-gdn-identity-state-source/configure.phase85.log` + - `/home/mudler/llama-phase85-gdn-identity-state-source/build.phase85.log` +- Gate artifact: + `/home/mudler/bench/phase85_gdn_identity_state_gates/20260701_171733/direct`. +- Profile artifact: + `/home/mudler/bench/phase85_gdn_identity_state_profile/20260701_171856`. +- Result type: source cleanup / small performance experiment. This reuses the + existing F32 recurrent-state CUDA kernel and changes only the source-state + view used for identity-contiguous decode windows. It avoids the ids scratch + allocation and no-op `gdn_gather_nonident_kernel` launch in that graph shape. + +Local verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops llama-server -j 8` completed | +| local note | `llama-server` build used the UI archive fallback after local npm engine warning; target completed | + +DGX gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_gather` ms | GDN macro launches | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|----------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `73` | `95` | `3.7081` | `1493.78` | `40.28%` | `1412.33` | `600` | `0.89` | `3600` | `1473.60` | +| Phase85 identity state | `/home/mudler/llama-phase85-gdn-identity-state-source` | `128` | `72` | `94` | `3.6622` | `1480.21` | `40.42%` | `1400.34` | `596` | not present | `2980` | `1437.53` | + +Server log signal: + +| arm | CUDA free memory at startup | graph reuse | +|-----|----------------------------:|------------:| +| baseline F32 | `116418 MiB` | `105/122 = 86.1%` | +| Phase85 identity state | `117857 MiB` | `105/123 = 85.4%` | + +Decision: + +- Carry forward only as a small cleanup candidate. The patch is md5/op green, + removes the explicit `gdn_gather` bucket, and reduces GDN macro launches. +- Do not treat it as a parity-closing speed lever: direct removed work was only + `0.89 ms` over the capture, and `gdn_core` improved by only `0.85%` + (`1412.33 -> 1400.34 ms`) in a noisy same-window run. +- Keep the next speed-focused scope on either producer fusion + (`alpha softplus * A`, beta sigmoid) or a larger packed decode kernel. The + remaining GDN gap is not explained by ids gather overhead. + +### Phase84: Attention-only Outputs for In-place GDN + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase84-attn-only-source`, local three-file + experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Local patch files: + - `ggml/src/ggml.c` + - `ggml/src/ggml-cpu/ggml-cpu.c` + - `ggml/src/ggml-cpu/ops.cpp` +- Build artifact: `/home/mudler/llama-phase84-attn-only-source/build-cuda`. +- Build logs: + - `/home/mudler/llama-phase84-attn-only-source/configure.phase84.log` + - `/home/mudler/llama-phase84-attn-only-source/build.phase84.log` +- Gate artifact: + `/home/mudler/bench/phase84_attn_only_gates/20260701_165952/direct`. +- Profile artifact: + `/home/mudler/bench/phase84_attn_only_profile/20260701_170131`. +- Result type: source cleanup / memory experiment. `ggml_gated_delta_net_inplace` + and `ggml_gated_delta_net_inplace_ids` now allocate only the attention-score + output tensor because final recurrent state is written as a side effect into + `state_dst`. The CPU `inplace_ids` non-identity fallback was moved from the + old unused output tail to explicit workspace so CPU/CUDA semantics remain + aligned. + +Local verification: + +| check | result | +|-------|--------| +| local build | `cmake --build build --target test-backend-ops -j 8` completed | +| local GDN subset | no non-CPU backend locally, so CPU was skipped by `test-backend-ops` | + +DGX gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `74` | `96` | `3.6464` | `1481.59` | `40.63%` | `1399.72` | `599` | `2.337 ms` | `1418.47` | +| Phase84 attention-only | `/home/mudler/llama-phase84-attn-only-source` | `128` | `65` | `87` | `3.5814` | `1489.33` | `41.59%` | `1407.38` | `598` | `2.354 ms` | `1349.11` | + +Server log memory signal: + +| arm | CUDA free memory at startup | graph reuse | +|-----|----------------------------:|------------:| +| baseline F32 | `117472 MiB` | `107/124 = 86.3%` | +| Phase84 attention-only | `117855 MiB` | `98/115 = 85.2%` | + +Decision: + +- Do not count Phase84 as a speed parity win. The target GDN bucket moved + `1399.72 -> 1407.38 ms` (`+0.55%`), and the lower total kernel time is again + explained by unrelated `mmq_nvfp4` variance (`1418.47 -> 1349.11 ms`). +- Keep as a possible memory-footprint cleanup only if upstream maintainability + is acceptable: gates are green and the server startup memory signal improved + by about `383 MiB` in the same profile window. +- Do not regenerate the LocalAI patch series until a follow-up decides whether + this memory-only cleanup belongs in the fork commit stack. + +### Phase83: KDA GDN exp-cache Decode Shortcut + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase83-kda-gexp-source`, local one-file CUDA + experiment on top of fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Build artifact: `/home/mudler/llama-phase83-kda-gexp-source/build-cuda`. +- Build log: + `/home/mudler/llama-phase83-kda-gexp-source/build.phase83.log`. +- Gate artifact: + `/home/mudler/bench/phase83_kda_gexp_gates/20260701_184237/direct_retry`. +- Profile artifact: + `/home/mudler/bench/phase83_kda_gexp_profile/20260701_164731`. +- Result type: source micro-optimization. Cache the KDA per-row + `expf(g_t[i])` value in a register once per token/thread in + `ggml/src/ggml-cuda/gated_delta_net.cu`, then reuse it in both the KDA + `kv` and S-update loops. This preserves the same recurrence storage, + operation order at the algorithm level, and F32 state path. + +Gate harness notes: + +- First copied-harness attempt used a LocalAI worktree path that was not present + on DGX and failed before running gates. +- Second harness attempt refused to run because this job already owned the GPU + lock. +- First direct gate script had an `awk` quoting bug after producing partial + output. +- Corrected direct retry completed and is the valid gate artifact. + +Gates: + +| check | result | +|-------|--------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `GATED_DELTA_NET` | `46/46`, `Backend CUDA0: OK` | +| `MUL_MAT` | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` | `806/806`, `Backend CUDA0: OK` | + +Same-window decode-only profile: + +| arm | source | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|--------|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:| +| baseline F32 | `/home/mudler/llama-phase81-bf16-state-source` | `128` | `73` | `95` | `3.6487` | `1481.06` | `40.59%` | `1399.46` | `597` | `2.344 ms` | `1424.65` | +| Phase83 exp-cache | `/home/mudler/llama-phase83-kda-gexp-source` | `128` | `66` | `88` | `3.5501` | `1487.71` | `41.91%` | `1405.62` | `600` | `2.343 ms` | `1317.98` | + +Decision: + +- Reject carry-forward. The target GDN bucket was flat-to-slightly worse: + `gdn_core` changed `1399.46 -> 1405.62 ms` (`+0.44%`), while per-launch cost + stayed effectively unchanged (`2.344 -> 2.343 ms`). +- The lower total kernel time is not credited to the shortcut because the + unrelated `mmq_nvfp4` bucket dropped by `106.67 ms` in the candidate sample. +- Do not regenerate LocalAI patch-series output for this experiment. Next GDN + work should target a structural traffic or launch-shape change, not + single-expression reuse inside the current core loop. + +### Phase82: BF16 Persistent GDN S-Cache f16 KL Gate + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase81-bf16-state-source`, fork commit + `237ad9b96 feat(cuda): add BF16 Qwen GDN state cache`. +- Build artifact: `/home/mudler/llama-phase81-bf16-state-source/build-cuda`. +- KL artifact: + `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`. +- Result type: full MoE f16-reference KL gate for the Phase81 default-off + BF16 persistent GDN S-cache candidate. +- Reference base: `/home/mudler/bench/l4gate/klbase_moe.dat`, generated from + `/home/mudler/work/darwin_36b_opus/f16.gguf` at `-c 512 -b 2048 --chunks 16` + with f16 PPL `7.3760 +/- 0.29100`. +- Acceptance reference from `PAGED_BITEXACT_NOTE.md`: paged FP4-MMQ vs f16 + KLD `0.136000 +/- 0.003285`, PPL `7.4009`; non-paged FP4-MMQ vs f16 KLD + `0.136597 +/- 0.003157`. +- Run note: the script metadata hash lines hit an `awk` quoting issue, so + `BASE_SHA256` and `MODEL_SHA256_HEAD` are blank in `meta.txt`; both KL passes + completed and produced full logs. Treat the blank hashes as harness metadata + noise, not a model-output failure. + +Result: + +| arm | env | KLD vs f16 | PPL(Q) | PPL ratio vs f16 | same-top-p | max KLD | +|-----|-----|-----------:|-------:|-----------------:|-----------:|--------:| +| same-source F32 | `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1` | `0.136563 +/- 0.003242` | `7.418401 +/- 0.296694` | `1.006105 +/- 0.008899` | `83.725 +/- 0.578%` | `3.602697` | +| BF16 S-cache | `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` plus same env | `0.137162 +/- 0.003456` | `7.321044 +/- 0.290693` | `0.992902 +/- 0.008714` | `84.240 +/- 0.571%` | `5.973692` | + +Decision: + +- Reject promotion of the BF16 persistent GDN S-cache patch. +- Do not run serving A/B for this candidate under the current rules: the hard + lossy-path gate requires `KLD(new||f16) <= KLD(FP4-MMQ||f16)`, and the BF16 + S-cache mean KLD is above both the documented paged reference (`0.136000`) and + the same-source F32 measurement (`0.136563`). +- Keep the Phase81 source only as a local experimental branch unless the gate is + deliberately re-scoped. The next source attempt should preserve F32 recurrent + S-cache quality or reduce traffic without changing the MoE f16 KL band. + +### Phase81: Qwen35 BF16 Persistent GDN S-Cache + +- Date: 2026-07-01. +- Source: `/home/mudler/llama-phase81-bf16-state-source`, local fork patch in + `/home/mudler/_git/llama.cpp` branch `localai-paged`. +- Build artifact: `/home/mudler/llama-phase81-bf16-state-source/build-cuda`. +- Gate artifact: + `/home/mudler/bench/phase81_bf16_s_cache_gates/20260701_161350`. +- Profile artifacts: + - default F32: + `/home/mudler/bench/phase81_bf16_s_cache_profile/default_20260701_162117` + - BF16 S-cache: + `/home/mudler/bench/phase81_bf16_s_cache_profile/bf16_20260701_162028` +- KL smoke artifact: + `/home/mudler/bench/phase81_bf16_s_cache_kl/20260701_162322`. +- Result type: source experiment. `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` + stores Qwen35/Qwen35MoE persistent recurrent S cache in BF16 while keeping GDN + recurrence math, q/k/v/g/beta, and output in F32. Default remains F32. + +Implementation scope: + +- Added BF16 state support for `ggml_gated_delta_net_inplace_ids` only. +- Added CPU/CUDA BF16 state load/store conversion at the persistent cache + boundary. +- Added BF16 CPU/CUDA `SCALE` support because recurrent cache zeroing uses + `ggml_scale_inplace(..., 0)` on the S cache. +- Added tests for BF16 `GATED_DELTA_NET_INPLACE_IDS` and BF16 in-place `SCALE`. + +Local verification: + +| check | result | +|-------|--------| +| RED test before implementation | `ggml_gated_delta_net_inplace_ids` rejected BF16 state at `state->type == GGML_TYPE_F32` | +| CPU `SCALE -p bf16` | `1/1` passed | +| CPU `GATED_DELTA_NET_INPLACE_IDS` | `2/2` passed | +| DGX CUDA build | completed for `llama-completion`, `llama-batched-bench`, `test-backend-ops`, `llama-server`, later `llama-perplexity` | + +Gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| default F32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| BF16 S-cache | `07db32c2bcb78d17a43ed18bc22705cd` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Profile: + +| arm | env | active slots | depth start | depth mid | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|-----|-------------:|------------:|----------:|---------------:|-------:|----------:|--------------:|--------------------:|------------------:|---------------:| +| default F32 | none | `128` | `65` | `87` | `3.6157` | `1480.44` | `40.94%` | `1399.30` | `599` | `2.336 ms` | `1394.28` | +| BF16 S-cache | `LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16` | `128` | `65` | `91` | `3.5244` | `961.61` | `27.28%` | `863.57` | `720` | `1.199 ms` | `1665.38` | + +KL smoke against same-source F32 base: + +| check | result | +|-------|--------| +| shape | MoE, `-c 256 -b 256 --chunks 32`, Wikitext-2 raw | +| F32 floor KLD vs F32 base | `0.000000 +/- 0.000000`, same-top-p `99.975%` | +| BF16 S-cache KLD vs F32 base | `0.055499 +/- 0.001705`, same-top-p `88.361%` | +| BF16 PPL ratio vs F32 base | `1.010356 +/- 0.005817` | + +Decision: + +- Carry forward as a default-off candidate and run Phase82 full gates. +- Do not make it default-on: MoE greedy md5 is not canonical, and the KL smoke is + not the full f16-reference acceptance gate. +- Required Phase82 before patch-series promotion: + full f16-reference KL gate for MoE and dense, same-source serving A/B against + F32 default and vLLM, then regenerate LocalAI patches from the fork only if + serving and KL both hold. + +### Phase80: GDN Identity-Ids Shortcut Source A/B + +- Date: 2026-07-01. +- Artifact root: + `/home/mudler/bench/phase80_gdn_identity_ids_ab/20260701_153927`. +- Arms: + - `A_baseline`: `/home/mudler/llama-phase6-source`, default source + `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. + - `B_identity`: `/home/mudler/llama-phase80-gdn-identity-source`, one-file + default-off source patch in `ggml/src/ggml-cuda/gated_delta_net.cu`, + enabled with `GDN_ASSUME_IDENTITY_IDS=1`. +- Result type: source A/B of an identity-ids shortcut that skips the + non-identity scratch gather for one-token final-state decode and reads the + in-place state cache directly. +- Shape: same as Phase77 decode-only graph-node profile. +- Build: candidate CUDA build completed for `llama-completion`, + `llama-batched-bench`, `test-backend-ops`, and `llama-server`. + +Gates: + +| arm | phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-----|-------|---------|-----------|-----------|--------------| +| `A_baseline` | pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| `A_baseline` | post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| `B_identity` | pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| `B_identity` | post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Capture: + +| arm | active slots | depth start | depth mid | `gdn_core` launches | +|-----|-------------:|------------:|----------:|--------------------:| +| `A_baseline` | `128` | `74` | `96` | `600` | +| `B_identity` | `128` | `65` | `87` | `600` | + +Result: + +| arm | env | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_gather` ms | GDN macro launches | +|-----|-----|---------------:|-------:|----------:|--------------:|----------------:|------------------:| +| `A_baseline` | none | `3.7132` | `1493.57` | `40.22%` | `1411.65` | `0.79` | `3600` | +| `B_identity` | `GDN_ASSUME_IDENTITY_IDS=1` | `3.5685` | `1489.96` | `41.75%` | `1409.28` | not present | `3000` | + +Decision: + +- Reject carry-forward/default for `GDN_ASSUME_IDENTITY_IDS=1`. +- The shortcut did remove the `gdn_gather` fine bucket and kept all gates + green, but the removed bucket was only `0.79 ms` over the capture and + `gdn_core` was effectively unchanged. +- The identity assumption is too narrow/risky for the size of the measured win. + Do not spend more parity time on gather-only GDN shortcuts unless a future + profile shows gather becoming material. +- Keep the next real GDN source scope on recurrent-state precision/traffic. + +### Phase79: GDN Decode BV32 Source A/B + +- Date: 2026-07-01. +- Artifact root: + `/home/mudler/bench/phase79_gdn_decode_bv32_ab/20260701_152530`. +- Arms: + - `A_baseline`: `/home/mudler/llama-phase6-source`, default source + `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. + - `B_bv32`: `/home/mudler/llama-phase79-gdn-source`, one-file default-off + source patch in `ggml/src/ggml-cuda/gated_delta_net.cu`, enabled with + `GDN_DECODE_BV32=1`. +- Result type: source A/B of a decode-only `S_v=128`, `n_tokens=1`, + scalar-gate smaller-V-tile kernel inspired by vLLM's packed decode topology. +- Shape: same as Phase77 decode-only graph-node profile. +- Build: candidate CUDA build completed for `llama-completion`, + `llama-batched-bench`, `test-backend-ops`, and `llama-server`. + +Gate detail: + +- Candidate default gates before profiling were green: MoE md5 + `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- Candidate opt-in gates before the A/B were green with `GDN_DECODE_BV32=1`: + same md5 values, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. +- A/B baseline pre-gates were green. Baseline post-gate first run hit a + transient `MUL_MAT 1145/1146` failure on + `MUL_MAT(type_a=q4_1,type_b=f32,m=16,n=1,k=256,...)`; immediate retry at + `A_baseline/gate_post_retry` was green for md5, `MUL_MAT 1146/1146`, and + `MUL_MAT_ID 806/806`. +- `B_bv32` pre/post gates were green with `GDN_DECODE_BV32=1`. + +Capture: + +| arm | active slots | depth start | depth mid | `gdn_core` launches | +|-----|-------------:|------------:|----------:|--------------------:| +| `A_baseline` | `128` | `67` | `89` | `600` | +| `B_bv32` | `128` | `72` | `93` | `570` | + +Result: + +| arm | env | total kernel s | GDN ms | GDN share | `gdn_core` ms | `gdn_core`/launch | `mmq_nvfp4` ms | +|-----|-----|---------------:|-------:|----------:|--------------:|------------------:|---------------:| +| `A_baseline` | none | `3.6274` | `1493.14` | `41.16%` | `1411.46` | `2.352` | `1392.60` | +| `B_bv32` | `GDN_DECODE_BV32=1` | `3.5739` | `1502.89` | `42.05%` | `1426.17` | `2.502` | `1363.65` | + +Decision: + +- Reject the BV32 decode source patch. +- Although all safety gates passed, normalized `gdn_core` worsened by about + `6.4%` per launch and the GDN macro bucket increased. +- Lower total kernel time in the candidate is not accepted as a win because the + capture contains fewer graph-node launches (`570` vs `600` `gdn_core`), while + the per-launch GDN core cost is worse. +- Do not retry smaller V-tile decode topology without a new profile-level + reason. The next GDN source hypothesis should attack recurrent-state + precision/traffic or another structural difference from vLLM. + +### Phase78: GDN Decode Launch-Shape Sweep + +- Date: 2026-07-01. +- Baseline artifact: + `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`. +- Sweep artifacts: + - `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654` + - `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954` +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: env-gated launch-shape sweep only; no source change. +- Shape: same as Phase77 decode-only graph-node profile. + +Result: + +| arm | env | gate status | GDN ms | GDN share | `gdn_core` ms | `gdn_core` share | `mmq_nvfp4` ms | +|-----|-----|-------------|-------:|----------:|--------------:|-----------------:|---------------:| +| Phase77 default | none | pre/post green | `1489.71` | `41.20%` | `1408.33` | `38.95%` | `1383.50` | +| sweep `8x8` | `GDN_NW=8 GDN_CPW=8` | pre/post green | `1525.86` | `41.94%` | `1443.55` | `39.68%` | `1366.33` | +| sweep `16x4` | `GDN_NW=16 GDN_CPW=4` | rejected | not run | not run | not run | not run | not run | + +Gate detail: + +- `8x8`: pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- `16x4`: completion md5 and `MUL_MAT 1146/1146` passed, but + `MUL_MAT_ID` failed `805/806`; rejected before profiling. + +Decision: + +- Keep the current default `GDN_NW=16 GDN_CPW=8`. +- Do not spend more GB10 time on launch-shape retunes without a new hypothesis. +- The funded source path remains a structural default-off GDN decode A/B/PoC + that reduces the Phase77 `gdn_core` bucket, not another existing-env sweep. + +### Phase77: MoE Decode-Only Graph-Node Profile + +- Date: 2026-07-01. +- Artifact: + `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`. +- Setup-hiccup artifact: + `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_145815`. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: current-stack llama.cpp decode-only graph-node profile; no + source change. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion` + requests, `N_PREDICT=2048`, capture after active decode. +- Capture window: active slots `128`; median decoded depth `67` at start and + `89` mid-capture; `CAPTURE_SECONDS=4`. +- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with + `/home/mudler/bench/bucket2.py`. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Macro buckets: + +| bucket | time ms | share | instances | +|--------|--------:|------:|----------:| +| GDN | `1489.71` | `41.20%` | `3600` | +| MoE/FFN-GEMM | `1400.77` | `38.74%` | `7220` | +| bf16/fp8-proj | `352.90` | `9.76%` | `7400` | +| layout-copy | `69.85` | `1.93%` | `10400` | +| act-quant | `67.63` | `1.87%` | `4820` | +| FA | `36.74` | `1.02%` | `600` | + +Fine buckets: + +| bucket | macro | time ms | share | instances | +|--------|-------|--------:|------:|----------:| +| `gdn_core` | GDN | `1408.33` | `38.95%` | `600` | +| `mmq_nvfp4` | MoE/FFN-GEMM | `1383.50` | `38.26%` | `4820` | +| `gdn_conv` | GDN | `71.76` | `1.98%` | `1200` | +| `gdn_l2norm` | GDN | `8.81` | `0.24%` | `1200` | +| `gdn_gather` | GDN | `0.80` | `0.02%` | `600` | + +Decision: + +- Phase77 confirms Phase76's GDN bucket is not only prompt/prefill + contamination. In an isolated decode window, `gdn_core` is the largest fine + bucket and is slightly larger than `mmq_nvfp4`. +- This supersedes the Phase75 no-GB10-GDN-source stance. The source-funded path + is no longer C=64 prefill inverse work; it is a narrow default-off GDN decode + A/B or standalone PoC based on the direct recurrent/packed decode structure + found in vLLM. +- Acceptance gate for the next source attempt: + reduce the Phase77 `gdn_core` bucket materially, keep pre/post md5 and + `MUL_MAT`/`MUL_MAT_ID` green, and show no serving/decode throughput + regression under the same decode-only capture shape. + +### Phase76: Current MoE Serving Graph-Node Profile + +- Date: 2026-07-01. +- Artifact: + `/home/mudler/bench/phase76_current_moe_profile/20260701_145116`. +- Setup-hiccup artifacts: + `/home/mudler/bench/phase76_current_moe_profile/20260701_144754` and + `/home/mudler/bench/phase76_current_moe_profile/20260701_144929`. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: current-stack llama.cpp graph-node serving profile; no source + change. +- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`, + `PARALLEL=128`, `CTX=131072`, production defaults. +- Profiler: `nsys launch --cuda-graph-trace=node`, bucketed with + `/home/mudler/bench/bucket2.py`. + +Gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Serving result under graph-node profiling: + +| n | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | ttft_mean_ms | wall_s | +|--:|--------:|---------------:|------------------:|------------:|-------------:|-------:| +| `128` | `204.1` | `320.7` | `2.06` | `1490.1` | `8365.1` | `40.146` | + +Macro buckets: + +| bucket | time ms | share | instances | +|--------|--------:|------:|----------:| +| GDN | `6669.16` | `32.88%` | `25980` | +| MoE/FFN-GEMM | `6264.88` | `30.88%` | `54406` | +| bf16/fp8-proj | `2772.38` | `13.67%` | `53880` | +| layout-copy | `1265.44` | `6.24%` | `81280` | +| ew-mul(weight/norm/GDN) | `734.61` | `3.62%` | `52464` | +| act-quant | `678.95` | `3.35%` | `37526` | +| FA | `264.50` | `1.30%` | `3660` | + +Fine buckets: + +| bucket | macro | time ms | share | instances | +|--------|-------|--------:|------:|----------:| +| `gdn_core` | GDN | `5876.94` | `28.97%` | `4680` | +| `gdn_conv` | GDN | `454.03` | `2.24%` | `7260` | +| `gdn_gather` | GDN | `237.87` | `1.17%` | `4680` | +| `gdn_l2norm` | GDN | `100.32` | `0.49%` | `9360` | +| `mmq_nvfp4` | MoE/FFN-GEMM | `6055.03` | `29.85%` | `34162` | + +Decision: + +- Phase76 contradicts the Phase75 assumption that GDN decode is not on the + current critical path. Under graph-node current serving, GDN is the largest + GPU-kernel macro bucket and `gdn_core` alone is nearly `29%`. +- Do not patch `gated_delta_net.cu` yet. This profile is llama-only and + graph-node tracing depresses absolute throughput, so it is a source-funding + signal, not a source patch gate. +- Fund Phase77 as a narrow proof before backend edits: + compare current `gdn_core` against a vLLM-style direct recurrent/packed decode + PoC or an in-backend default-off A/B, with pre/post md5 and op gates, and + require a material reduction in the Phase76 `gdn_core` bucket without + regressing serving throughput or canonical md5. + +### Phase75: Post-PoC GDN/VLLM Audit + +- Date: 2026-07-01. +- Artifact: no new benchmark artifact. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: subagent codebase audit and gate-setting only; no source change. +- Inputs: Phase74 artifact + `/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`, + llama.cpp GDN implementation, vLLM FLA/GDN implementation, and parity docs. + +Findings: + +- llama.cpp already has the M5 tensor-core GDN path default-on under paged KV. + It includes `KK/QK` mma, `KS/QS` 3xtf32 mma, `P*U` mma, explicit + `T=A^-1`, `U=T*RHS`, and state carry `Kc^T*DU`. +- The current backend path is fixed at `C=16` for GB10 shared-memory limits. + The remaining C=64/register-state class is not a shortcut patch. +- Phase74 tested a C=64 shared-memory explicit inverse-plus-apply scaffold and + failed its source-work gate: inverse/direct speed was `0.5941x` weak decay + and `0.5927x` mixed decay. +- vLLM has a structurally different one-token recurrent decode kernel that + updates state directly without chunk inverse, and a packed decode path that + avoids Q/K/V materialization copies. This is not currently source-funded in + llama.cpp because prior parity profiles showed llama.cpp GDN decode faster + than vLLM and decode serving dominated by host/MoE synchronization. +- vLLM's CuTeDSL GDN prefill path uses SM10x/CUDA-13 Blackwell features + including TMA/tcgen05/CUTLASS DSL. Treat it as datacenter-Blackwell reference + evidence unless GB10 support is proven in the local toolchain. + +Decision: + +- Do not start GB10 GDN backend source work after Phase74/75. +- Do not start a packed/recurrent GDN decode PoC unless a fresh same-session + profile shows GDN decode or Q/K/V materialization back on the critical path. +- Phase75 acceptance gate for the next real parity attempt is a datacenter + Blackwell serving rerun with the Phase72 shape: + `NPL=8 32 128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`, production defaults. +- The rerun is valid only if `hardware.txt` records + `hardware_class=datacenter_blackwell`, pre/post md5 gates are green + (`8cb0ce23777bf55f92f63d0292c756b0`, + `5951a5b4d624ce891e22ab5fca9bc439`), `MUL_MAT 1146/1146` and + `MUL_MAT_ID 806/806` are green, and decode profiles include + `nsys --cuda-graph-trace=node`. +- If datacenter Blackwell materially lifts llama/vLLM decode ratios above the + GB10 Phase72 record (`0.7561`, `0.7158`, `0.6935`), continue parity work on + that surface. If not, record the residual gap as engine/kernel architecture + rather than GB10 memory bandwidth and keep GB10 GDN stopped. + +### Phase74: GDN Blocked-Solve PoC Gate + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-gdn-blocked-solve-poc-phase74.md`. +- Artifact: + `/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: standalone CUDA microbenchmark only; no llama.cpp source change. +- Toolchain: CUDA `13.0.88`, `nvcc -O3 -arch=sm_121a`. +- Hardware: NVIDIA GB10, `cc=12.1`, `48` SMs, `99 KB` dynamic shared memory. +- Shape: `C=64`, `DK=128`, `DV=128`, `chunks=4096`, `iters=1000`. +- Shared memory: direct solve/apply `81920` bytes; inverse-plus-apply + `98304` bytes. + +Result: + +| case | direct ms | inverse+apply ms | inverse/direct speed | direct NMSE | inverse NMSE | direct max abs | inverse max abs | max lower row sum | +|------|----------:|-----------------:|---------------------:|------------:|-------------:|---------------:|----------------:|------------------:| +| weak decay | `3.263936` | `5.493515` | `0.5941x` | `2.081e-14` | `2.755e-15` | `8.890e-07` | `2.415e-07` | `4.072` | +| mixed decay | `3.275959` | `5.527584` | `0.5927x` | `1.981e-14` | `7.541e-16` | `8.115e-07` | `7.888e-08` | `1.635` | + +Decision: + +- Reject this explicit inverse-plus-apply shape as a backend source candidate on + GB10. It is numerically clean but materially slower than direct solve/apply. +- Do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for the larger C=64 path + based on this attempt. +- A future GDN source-work gate would need a substantially different + tensor-core blocked solve/register-state design, not this shared-memory + inverse scaffold. + +### Phase73: Datacenter Blackwell Rerun Readiness + +- Date: 2026-07-01. +- Plan: + `docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`. +- Artifact: no new benchmark artifact. +- Source baseline: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Result type: harness/spec audit only. + +Evidence: + +- Phase72 is the current GB10 serving baseline. Default llama decode/vLLM + ratios remain `0.7561`, `0.7158`, and `0.6935` at `n=8/32/128`. +- Grouped-MMQ/W4A16: Phase61 direct activation was the last structurally + distinct W4A16 shortcut; it failed its keep gate and stayed far behind + default FP4-MMQ. Phase66 quantize plus gather was only `5.10%`, below the + source-funding threshold. +- GDN: Phase71 kept shipped M5 as default. The remaining GDN gap is a larger + FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, not + another C32/QS/global-Ai/local reorder. +- Harness: `paged-current-serving-snapshot.sh` already records + `hardware_class=datacenter_blackwell` for B200/B100/GB200, supports + `DRY_RUN=1`, `SERVED_MODEL_NAME`, and vLLM deployment overrides. + +Decision: + +- Do not start more GB10 grouped-MMQ/W4A16 source work. +- Do not start GDN backend source work until a standalone C=64 blocked-solve + PoC records timing, numerical error, and resource estimates. +- The next parity run should be on datacenter Blackwell hardware with the + existing same-session serving harness plus graph-node decode profiles. +- No parity claim is made by this phase. + +### Phase72: TTFT Min32 Broader Serving + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md`. +- Artifact: + `/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`. +- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072`. +- Env gate: `LLAMA_TTFT_PREFILL_FIRST=1` + `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32`. + +Gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre min32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run | +| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run | +| post min32 | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | not run | not run | + +Result: + +- Reject default-on for min32 in the broader serving shape. +- Keep the scheduler knob opt-in only. +- min32 regressed aggregate, decode, TTFT, and wall time for every tested + concurrency. + +### Phase71: GDN Tensor-Core Revalidation + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`. +- Artifact: + `/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. +- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Shape: MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`. + +Canonical gates: + +| gate | env | MoE md5 | dense md5 | `GATED_DELTA_NET` | `MUL_MAT` | `MUL_MAT_ID` | +|------|-----|---------|-----------|-------------------|-----------|--------------| +| default | none | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | `1146/1146` | `806/806` | +| sequential-disabled | `GDN_CHUNK_MIN=2147483647` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | +| serial-chunked | `GDN_TC=0 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | +| forced M5 | `GDN_TC=4 GDN_CHUNK_MIN=64` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `46/46` | not run | not run | + +MoE prefill: + +| arm | npp | S_PP t/s | T_PP s | S_TG t/s | total S t/s | +|-----|----:|---------:|-------:|---------:|------------:| +| default | `512` | `2313.57` | `7.082` | `401.82` | `2231.28` | +| sequential-disabled | `512` | `2198.28` | `7.453` | `392.50` | `2122.58` | +| serial-chunked | `512` | `1787.49` | `9.166` | `396.23` | `1740.12` | +| forced M5 | `512` | `2323.18` | `7.052` | `393.62` | `2238.13` | +| default | `2048` | `2422.88` | `27.049` | `389.91` | `2398.50` | +| sequential-disabled | `2048` | `2361.22` | `27.755` | `386.08` | `2337.91` | +| serial-chunked | `2048` | `1699.77` | `38.556` | `389.48` | `1688.69` | +| forced M5 | `2048` | `2420.52` | `27.075` | `388.72` | `2396.11` | + +Ratios: + +| npp | default/sequential S_PP | default/serial S_PP | forced/default S_PP | +|-----|------------------------:|---------------------:|--------------------:| +| `512` | `1.0524` | `1.2943` | `1.0042` | +| `2048` | `1.0261` | `1.4254` | `0.9990` | + +Decision: + +- Keep shipped GDN M5 default behavior. +- Do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. +- The stale "two-Gram PoC before M5 exists" framing is superseded by the + existing `0047` M5 implementation and this revalidation. + +### Phase70: BF16 F32 Output Broader Serving + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`. +- Source: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. +- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072`. + +Gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | +| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | + +Result: + +- Default-on rejected. +- Opt-in remains correctness-clean, but broad serving is mixed-to-negative. + +### Phase69: Patch Series Mirror Readiness + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`. +- Artifact: local dry-run only. +- Result: current `0001..0063` series matched Phase37 tree + `dedb1182910eafe9f6875588dc8285bfb544cce5`; projected `0064..0073` + matched fork HEAD tree `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4`. +- Decision: patch regeneration is technically ready but blocked on explicit + push approval by policy. + +### Phase68: BF16 F32 Output Dense Serving + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +- Artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`. +- Serving artifact: + `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`. + +Dense prefill: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving `N=128`, prompt `128`, generation `128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: + +- Carry as default-off opt-in candidate pending broader serving evidence. + +### Phase67: BF16 cuBLAS F32 Output + +- Date: 2026-07-01. +- Plan: `docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`. +- Artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. +- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. +- DGX mirror commit: `14fd69f1e`. +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +Gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +MoE prefill: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +Decision: + +- Keep default-off pending dense and serving A/B. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md new file mode 100644 index 000000000000..cd51d2d5d0f5 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md @@ -0,0 +1,422 @@ +# DECODE_SERVING_SCOPE - the continuous-serving decode gap + +**Status: S1 + S3 IMPLEMENTED, GPU-validated, bit-exact, shipped as patches +0040 (S1) + 0041 (S3). S2 DROPPED (measured non-target). See the results block +below; the rest of this doc is the design/rationale those patches implement.** + +## Results (GB10, measured) + +Phase 0 confirmed host-bound: serving graph reuse **0% over ~5k steps** (layer-A +rebuilds every step), `hostproc` 3.44 ms/step vs 1.59 static - the +1.85 ms IS the +graph rebuild; `set_inputs` 0.047 ms and block-table 0.002 ms are negligible. + +- **S1 (patch 0040)** - root cause: the paged decode inputs never overrode + `can_reuse` (defaults false), so the graph could never be reused. Fixed with a + 256-bucketed-shape `can_reuse` + live-mctx refresh. Static batched-bench A/B: + paged decode reuse **0% -> 95.5%**, bit-exact (md5 byte-identical reuse on/off). + Necessary but **not** sufficient in serving (13.8% reuse alone - prefill + co-batching churns the shape). +- **S3 (patch 0041)** - keeps prefill out of decode steps so the scheduler emits + reuse-stable pure-decode steps. **S1+S3 together (128-client staggered serving, + MoE Qwen3.6-35B-A3B-NVFP4): reuse 0% -> 72.2%, `hostproc` 15.98 -> 6.31 ms/step, + decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9).** +- **S2 (double-buffer set_inputs) - DROPPED.** Phase 0 put `set_inputs` at + ~0.05 ms/step: it is not the cost (the rebuild is), so S2 has nothing to recover. +- **Follow-up to ~100% reuse - PADDED/FIXED-SLOT DECODE SHAPE: IMPLEMENTED, + GPU-TESTED, REJECTED (not shipped).** See the "Padded-shape lever - rejected" + block below. Summary: it does NOT close the serving gap. Padding holds the + pure-decode width constant by emitting masked-inert dummy decodes for idle + slots, and it is provably inert (single-seq md5 bit-exact + per-stream + noise-floor determinism), but it **regresses throughput at every concurrency** + (catastrophically at low load) because the serving decode here is + **GPU-compute-bound, not host-rebuild-bound** - so the dummy-row compute it adds + costs more than the graph-reuse it recovers. The original "remaining ~28% is + request-boundary churn -> pad it" hypothesis stands mechanically, but the payoff + premise (closing reuse pulls decode toward vLLM) is **not supported by + measurement**. + +--- + +## Padded-shape lever - rejected (implemented + GPU-tested, 2026-06-28) + +The S1 section-(a) **padded / fixed-slot decode shape** was implemented in an +isolated worktree off the committed S1/S3/tail base (paged HEAD `05eceb4`), built +CUDA-only, and benched on GB10. **Verdict: REJECTED - it regresses serving +throughput and does not close the vLLM gap.** Recorded here so it is not re-tried. + +**Implementation** (default-off, `LLAMA_PAGED_PAD_DECODE=1`; `LLAMA_PAGED_PAD_WIDTH` +caps the slot range): at the end of `pre_decode()`, on any step where no prompt +tokens were admitted (`n_prompt_budgeted == 0`) and there is decode load, emit a +masked-inert dummy decode for **every IDLE slot** (`batch.add(slot.id, 0, +pos_max+1, /*output=*/true)`; cold slot -> fresh pos-0). This holds `n_tokens`, +`n_seqs`, `n_seqs_unq`, `n_outputs` and the participating seq-id SET constant +across arrivals/completions. A `release()`-side guard keeps a finished slot warm +under padding (else patch 0024's reclaim-on-idle frees its KV and the next-step +pos-0 re-warm churns paged-block allocation, destroying reuse). Each dummy is its +OWN sequence, so its recurrent (gated-DeltaNet) state is private and its paged +attention reads only its own cells; its logits are computed but never read +(`post_decode()` only consumes `slot.i_batch` of GENERATING slots). + +**Gates.** (1) Single-seq greedy md5 **bit-exact PASS** - dense +`5951a5b4d624ce891e22ab5fca9bc439`, paged-MoE `8cb0ce23777bf55f92f63d0292c756b0` +(the lever lives only in `llama-server`'s `update_slots()`, never in +`llama-completion`). (2) **Per-stream serving determinism**: the literal +"ON-vs-OFF token sequences identical" gate is **unachievable** - concurrent +cuBLAS/FA decode is **not bit-reproducible run-to-run** even with padding OFF +(OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16, lockstep K=16). The +**achievable inertness gate PASSED**: per-stream prefix-agreement ON-vs-OFF equals +the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), i.e. the +dummy slots inject no systematic divergence beyond the pre-existing concurrent FP +noise. So padding is provably inert; it just does not help. + +**Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10).** Burst h2h, decode tok/s/seq: + +| n | S1+S3 | PAD | vLLM | +|-----|-------|------|------| +| 8 | 28.16 | 6.05 | 44.8 | +| 32 | 11.66 | 4.84 | 17.45| +| 64 | 7.16 | 4.33 | 11.07| +| 128 | 4.53 | 4.32 | 6.87 | + +Staggered (`serve_bench.py` k=128 n=160 stagger0.25), aggregate decode tok/s and +graph-reuse: baseline (reuse 0%) **757.6**; S1+S3 (reuse 72%) **763.3**; **PAD +(reuse 38%) 558.0**. + +**Why it fails (four independent reasons):** + +1. **Serving decode is GPU-compute-bound, not host-rebuild-bound (this run).** + Baseline reuse 0% (757.6 agg) is statistically equal to S1+S3 reuse 72% (763.3 + agg): `hostproc` is only ~4-8% of the per-step wall, so eliminating the host + graph rebuild buys ~nothing. (This **corrects the host-bound hypothesis** above + for this hardware: the earlier 542->762 host-bound delta did **not** reproduce + - it was GPU-state/contention variance, not a stable reuse effect.) +2. **Padding ADDS dummy-row compute** (full-width decode), costing throughput in + direct proportion to `pad_width - real_load`: catastrophic at low concurrency + (n=8: 28.16 -> 6.05, ~4.6x slower, because 8 real streams pay for a 128-wide + step). +3. **In continuous serving padding can't even hold the width constant**: arrivals + are perpetually mid-prefill, so the idle-slot count varies and reuse DROPS + 72% -> 38% (the opposite of the goal). It only stabilises the pure-decode + *tail* of a burst (verified: width pinned at 64 as real decoders fell 49->5), + which is exactly where the dummy compute is most wasteful. +4. **The completion-driven batch shrink that padding prevents is itself a + throughput WIN** in a compute-bound regime (fewer real streams -> cheaper + steps -> survivors finish faster); forcing constant width forfeits it. + +**Conclusion.** The residual burst gap (paged 4.53 vs vLLM 6.87 at n=128 ~= 66%) +is a **GPU-compute** gap (vLLM's MoE decode kernel + scheduler are ~1.3x faster on +aggregate), not a host-loop gap. A host-side graph-reuse lever cannot close it. +Do not re-pursue padded/fixed-slot shapes for throughput; if the host loop is ever +re-confirmed dominant on other hardware (re-run reason 1's baseline-vs-S1+S3 A/B +first), revisit - but only with an *adaptive* width matched to live load, never a +fixed pad-to-`--parallel`. + +--- + +Per the +"profile-don't-assume" rule in +[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md), +**Phase 0 (section 5) is to confirm the bottleneck on GPU before touching any +code.** Everything below the Phase-0 line is a hypothesis ranked by +value/effort/risk, not a measured result. + +> **Regime warning (read first).** Every "decode is at the BW floor / ties vLLM" +> and "host scheduling loop is the structural residual" conclusion in +> [`README.md`](../README.md) section 5 was measured with **`llama-batched-bench`**: +> a STATIC serving width (fixed `npl`, all sequences in lockstep, constant +> batch shape every step). That is the **decode KERNEL** regime, and there the +> patch series is at parity (paged ~6.1 tok/s/seq vs vLLM ~5.9 at npl128). This +> document is about a **different regime**: real **continuous SERVING** through +> `llama-server`'s `update_slots()` loop, where requests arrive and complete +> asynchronously, the batch shape churns every step, and paged drops to ~3.7 +> tok/s/seq (-39%) while vLLM sustains ~5.9. The gap is the **scheduler / host +> loop**, not the kernel. This is the serving analogue of the prefill-GEMM regime +> split called out in [`PREFILL_GEMM_SCOPE.md`](PREFILL_GEMM_SCOPE.md). + +Cross-links: [`README.md`](../README.md) sections 2 (scheduler), 3 (patches +0008/0013/0016/0024/0025/0029), 5 (rejected levers - lever 2 graph coverage was +FLAT *in the static regime*; this doc reopens it for the *serving* regime); +[`.agents/llama-cpp-localai-paged-backend.md`](../../../../.agents/llama-cpp-localai-paged-backend.md) +(bit-exact gate); +[`.agents/vllm-parity-methodology.md`](../../../../.agents/vllm-parity-methodology.md) +(both-engine ground-truth, per-lever A/B, record-rejected-levers). + +--- + +## 1. The two regimes, and why the kernel-parity result does not carry over + +`llama-batched-bench` and a real serving workload exercise the **same decode +kernels** but **different host loops**: + +| | `llama-batched-bench` (kernel regime) | `llama-server` continuous serving | +|---|---|---| +| batch shape per step | **constant** (fixed `npl`, lockstep) | **churns** (arrivals/completions, interleaved prefill) | +| participating seq-set | **fixed** for the whole run | **changes** as requests start/finish | +| graph reuse (see s.2) | holds after warmup -> 1 capture, replayed | breaks nearly every step -> rebuild + re-capture | +| measured | paged ~6.1 tok/s/seq ~ vLLM ~5.9 | paged ~3.7 vs vLLM ~5.9 (-39%) | + +The README's decode parity, BW-floor, and "host loop is the irreducible +residual" findings are all **kernel-regime** findings. They prove the *kernels* +are not the serving gap. They do **not** prove the host loop is irreducible *in +serving* - the static bench holds the batch shape constant, which is exactly the +condition that lets both graph-reuse layers (section 2) stay hot. Serving +violates that condition. So the serving gap is reopened here as a host / +scheduler problem, orthogonal to the kernel. + +--- + +## 2. Root-cause hypothesis (from source, pin `9d5d882d` + the dev tree) + +There are **two independent graph-reuse layers**, and continuous batching breaks +**both** on nearly every step. This is the leading hypothesis for the -39%. + +### 2a. Layer A - llama-context graph reuse (`can_reuse` / `allow_reuse`) + +`llama_context::process_ubatch` (`src/llama-context.cpp` ~L1366) only **reuses +the built ggml graph** when `res->can_reuse(gparams)` holds. `allow_reuse` +(`src/llama-graph.h` ~L631) requires, among others: + +``` +ubatch.n_tokens == other.ubatch.n_tokens && +ubatch.n_seqs == other.ubatch.n_seqs && +ubatch.n_seqs_unq == other.ubatch.n_seqs_unq && +ubatch.equal_seqs() == other.ubatch.equal_seqs() +// + (when equal_seqs) the participating sequence-id SET must match +``` + +In serving, `n_tokens` changes whenever the decode load `D` changes or a prefill +chunk is co-batched, and the **sequence-id set** changes whenever a request +starts or finishes. Either makes `can_reuse` return false, so `process_ubatch` +falls into the `else` branch: **rebuild the graph** (`model.build_graph`) + +`ggml_backend_sched_reset` + `ggml_backend_sched_alloc_graph` - full host-side +graph construction + allocation, **every step**. In batched-bench all sequences +are lockstep so `n_tokens`/seq-set are constant and `can_reuse` is true after +warmup (the `graphs reused = N` perf line is ~all steps). + +### 2b. Layer B - CUDA graph capture (`ggml_cuda_graph_*`) + +Even when layer A reuses, the CUDA backend re-checks +`ggml_cuda_graph_update_required` (`ggml-cuda.cu` ~L3367): it `memcmp`s every +node's `ne`, `nb`, and `src[]->data` pointers against the captured graph. Any +shape change -> `cudaGraphExecUpdate` / re-instantiate. Two serving-specific +triggers: + +- **shape churn** (same root cause as layer A): different `n_tokens` -> different + node `ne` -> update required. +- **paged data-pointer churn**: when a co-batched prefill allocates new KV blocks + (or a finished sequence frees them), the per-step KV view tensors' `data` + pointers move, so even a constant-shape decode step can trip the `memcmp`. (The + block-table *contents* live in a fixed device buffer filled by `set_inputs`, so + the table tensor pointer itself is stable - 0029 keeps that cheap - but the K/V + cache views are not.) + +Net: under serving, the GPU sits idle between launches while the host rebuilds +the graph (layer A) and re-instantiates the CUDA graph (layer B), then runs an +un-graphed `set_inputs` (H2D input copies) before each launch. vLLM avoids this +with **padded/bucketed decode batch shapes + piecewise CUDA graphs**: it pads the +decode batch to a fixed set of sizes and captures one persistent graph per +bucket, so the steady-state decode step is a single `cudaGraphLaunch` with no +host rebuild. Its scheduler is also a tight C++ loop with chunked-prefill +interleave that keeps the GPU fed. + +### 2c. Per-step host work that runs un-graphed regardless (already instrumented) + +The dev tree carries a built-in `[L5INSTR]` profiler (`src/paged-attn.cpp`, +hooks in `src/llama-context.cpp` and `src/llama-kv-cache.cpp`) that already +isolates the host buckets we care about, printed at process exit: + +``` +[L5INSTR] get_block_table n=.. sum=..ms mean=..ms | set_inputs n=.. mean=..ms | hostproc n=.. mean=..ms +``` + +- `hostproc` = `mctx->apply()` + graph reuse-check/rebuild + `set_inputs`, i.e. + the whole host window **before** `graph_compute` (it does NOT include the GPU + launch). Prior profiles put this near ~1.4 ms/step. +- `set_inputs` = the H2D input fills (positions, masks, block table, idxs). +- `get_block_table` = the paged block-table host build (0029 caches it + within-step; `LLAMA_PAGED_NO_BT_CACHE` A/B-toggles that). + +If `hostproc` per step is a large fraction of the serving per-step wall time +(and the `graphs reused` count is low), the gap is host-bound, not kernel-bound. + +### 2d. The serial-SSM host loop (named in README s.5, secondary here) + +The gated-DeltaNet decode advances recurrent state per step; sampling cannot +start until logits land. The README already names this as a structural floor in +the *kernel* regime. It is the same in serving but is the *smaller* term - the +graph-rebuild/re-capture overhead (2a/2b) is the new, serving-specific cost the +static bench hides, and it is the one to attack first. + +--- + +## 3. What the already-shipped scheduler patches do (and do NOT do) + +These exist; understand them before proposing anything. **None of them touch the +two graph-reuse layers** - they target prefill freezing and burst collapse, not +steady-state decode-step host overhead. That is why the serving gap survives them. + +| Patch | What it does | What it does NOT do | +|---|---|---| +| 0008 cross-request prefix-share (server loop) | Concurrent shared-prefix requests prefill only the divergent suffix (fewer prefill tokens). | Does not stabilise decode batch shape; does not graph-reuse. | +| 0013 `LLAMA_PREFILL_BUDGET` | Static per-step prefill-token cap (vLLM `--max-num-batched-tokens` analogue); flattens the ITL spike a long prefill inflicts on co-batched decode. | Ignores decode load; per-workload tuning; no effect on decode-step graph reuse. | +| 0016 dynamic decode-first budget | `max(n_ubatch, T-D)` leftover-after-decode budget + per-slot chunk cap; decode claimed first, auto-shrinks as `D` rises. Stops a prefill chunk from inflating the step past `T`. | **Still lets the per-step decode `n_tokens` and seq-set vary**, so it does not make the decode step graph-reusable; it shapes prefill admission, not decode-shape stability. | +| 0024 paged-pool burst-reclaim | Truncate/defrag/release KV blocks; fixes long-server prefill burst collapse (488->44->532 t/s). | Host accounting only; nothing about decode-step graph capture. | +| 0025 `LLAMA_MOE_FORCE_GRAPHS` | Keeps CUDA graphs ON for the grouped-MMQ MoE decode step (lifts the conservative `MUL_MAT_ID` graph-disable). | Helps the CUDA-graph *eligibility* of one op; does **not** make layer-A/B *reuse* hold across churning steps. It is necessary-not-sufficient: a step that rebuilds anyway gets recaptured regardless. | +| 0029 block-table within-step cache | `get_block_table` computed once per step, memcpy'd to other full-attn layers (-87/-91%). | Shrinks one `set_inputs`/`hostproc` sub-term; does not address rebuild/re-capture. | + +**README s.5 "lever 2 (graph/stream coverage): FLAT"** was concluded **in the +static batched-bench regime**, where graphs already reuse - so more graph +coverage was correctly a no-op there. That conclusion does **not** apply to the +serving regime, where graphs do **not** reuse. This doc reopens graph coverage +**for serving only**; record it as a regime-scoped reopening, not a contradiction. + +--- + +## 4. Ranked lever plan (hypotheses - gate on Phase 0 first) + +Ranked by value/effort with bit-exactness/risk called out. All are **host-side / +scheduler** levers (no decode-kernel changes), so all are *bit-exact-safe by +construction* provided padding tokens are masked-inert and verified against the +per-path md5 gate. + +### Lever S1 (TOP) - bucketed/padded decode-step shape for graph reuse + +**Value: high (targets the dominant -39% mechanism). Effort: medium-high. Risk: +medium (correctness of padding inertness; seq-set churn is harder than n_tokens).** + +Make the steady-state decode step present a **stable, bucketed shape** to both +reuse layers, mirroring vLLM's padded decode batch + piecewise CUDA graphs: + +- Pad the per-step decode `n_tokens` (and the stream/seq count the graph sees) up + to the next bucket in a small fixed set (e.g. {power-of-two or fixed grid}), so + `allow_reuse` (layer A) and `update_required` (layer B) hold across steps with + the same bucket. Padding tokens are dummy, masked positions that contribute + nothing to any real sequence's logits. +- Bound the number of distinct live buckets so a handful of persistent CUDA + graphs cover steady decode (vLLM captures ~tens). +- Handle the seq-set component of `allow_reuse`: bucketing `n_tokens` alone is + insufficient because the *participating sequence-id set* must also match. Either + (a) pad to a fixed stream-slot layout so the seq-set is stable across arrivals + /completions, or (b) relax/extend the reuse key so a pure-decode step keyed on + bucket+slot-layout reuses regardless of which slots are occupied. (b) is the + higher-leverage but more invasive option. + +Bit-exact gate: greedy md5 per path with padding ON must equal the recorded +references (`5951a5b4` dense, `8cb0ce23` paged-MoE); `test-backend-ops` +unaffected (no op changes). The risk is that masked/padded positions leak into a +real logit (off-by-one in the mask) - the md5 gate catches it. + +### Lever S2 - overlap per-step host work with GPU decode (double-buffer inputs) + +**Value: medium-high (recovers the `hostproc` window even when S1 partial). +Effort: medium. Risk: low (host-side reordering only, bit-exact-safe).** + +Even with graphs reused, `set_inputs` (+ the pre-`set_inputs` sync) runs +un-graphed and serially *before* each launch (`hostproc` ~1.4 ms/step in prior +profiles). Overlap the host scheduling + input build of step N+1 with the GPU +decode of step N: double-buffer the input device tensors so the host can fill +N+1's inputs while N's graph is in flight, and prepare the next ubatch / block +table on the host concurrently. This is the llama.cpp analogue of vLLM keeping +the GPU fed. Strictly host-side, no numeric change -> bit-exact. (0029 already +banks part of this for the block table within a step; S2 extends it across +steps.) + +### Lever S3 - graph-shape-stable scheduling (bridge from 0016) + +**Value: medium (multiplies S1; low marginal value without S1). Effort: low-medium +(extends the existing 0016 policy). Risk: low (scheduler policy, bit-exact when +the decode result is unchanged).** + +Extend the existing decode-first budget (0016) so the scheduler actively *prefers +graph-reusable steps*: keep prefill chunks out of the decode step (run prefill in +its own steps, or at a fixed chunk size) so the decode batch shape stays on a +bucket rather than being perturbed by interleaved prefill tokens every step. This +is the policy half of S1 - S1 makes a bucketed step reusable; S3 makes the +scheduler emit bucketed steps. Pair them. + +**Rejected/deferred (record so they are not re-tried):** + +- **More CUDA-graph *coverage* alone (the README lever-2 redo): still FLAT + without S1.** Forcing more ops graph-eligible (beyond 0025) does nothing while + layer A rebuilds the graph every step - the recapture dominates. Only valuable + *after* S1 makes reuse hold. +- **`GGML_CUDA_DISABLE_GRAPHS` / disabling graphs in serving: REJECTED a priori + as a fix** (it is an A/B *probe* for Phase 0, not a lever) - it removes capture + cost but also removes replay benefit; expected net-negative. +- **Precision levers (W4A16, bf16-SSM): out of scope** - this gap is host-bound, + not GEMM/BW-bound (see README s.5 rejections; do not reopen). + +--- + +## 5. Phase 0 - confirm it is host-bound BEFORE building (run when the GPU frees) + +Do NOT build any lever until this confirms host-bound. The dev tree already has +all the instrumentation; this is a measurement, not a code change. **One GPU +bencher at a time** (GPU-contention rule). + +**Workload.** Real continuous serving, not batched-bench: run `llama-server` +(paged build) with the paged config and drive it with a steady concurrent +streaming load (e.g. a K-client async generator hitting `/completion` with +staggered arrivals so requests start/finish asynchronously - the regime +batched-bench cannot produce). Use the same models/flags as README s.4: +`-fa on -ngl 99`, `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for MoE), +dense Qwen3.6-27B-NVFP4 and MoE Qwen3.6-35B-A3B-NVFP4. Pick K so the *effective +decode width* matches a static `npl` you have a kernel-regime number for (e.g. +~128) - that gives the apples comparison: static 6.1 vs serving 3.7 tok/s/seq. + +**Signals to capture (all already exist):** + +1. **Graph reuse rate.** The `graphs reused = N` perf line (`llama-context.cpp` + ~L4146, from `data.n_reused`) over total decode steps. Hypothesis: ~100% in + batched-bench, near 0% in serving. This is the single most decisive number. + A/B with `LLAMA_GRAPH_REUSE_DISABLE=1` (forces the rebuild path) - if serving + is already near that floor, layer-A reuse is the gap. +2. **`[L5INSTR]` host buckets** (printed at exit): `hostproc`, `set_inputs`, + `get_block_table` mean ms/step. Compare serving vs batched-bench. A/B the + block-table cache with `LLAMA_PAGED_NO_BT_CACHE`. +3. **GPU-busy %** in a steady-state serving window via nsys (sum of kernel + durations / wall) and the **inter-launch host gap** (time between consecutive + `cudaGraphLaunch`/kernel launches). Hypothesis: batched-bench ~96-99% busy + (README/methodology note the early "low util" was a window artifact); serving + materially lower, with the gap ~= `hostproc`/step. *Watch the same window + artifact* the methodology warns about - measure a clean steady-state span. +4. **CUDA-graph re-instantiation count** - confirm layer B is also re-capturing + (nsys shows `cudaGraphInstantiate`/`cudaGraphExecUpdate` per step, or add a + host-side counter print - host-side only, no kernel code). + +**Decision rule.** Host-bound (proceed with S1/S2/S3) if: serving `graphs reused` +is low AND `hostproc`/step is a large fraction of serving per-step wall AND +GPU-busy% drops vs batched-bench by ~the observed throughput ratio (~3.7/6.1). +If instead GPU-busy% stays high and per-kernel time grows, the cause is +elsewhere (e.g. serving runs a worse effective batch shape into the kernels) - +re-scope before building. + +**Ground-truth vLLM (both-engine rule).** Capture vLLM at the same concurrency: +GPU-busy% / step cadence (nsys) and its scheduler step time. Confirm vLLM stays +GPU-bound (persistent graphs) where paged goes host-bound - that is the +direct evidence the gap is the host loop, and it sizes the achievable win. + +--- + +## 6. Summary + +- The serving gap (paged 3.7 vs vLLM 5.9 tok/s/seq, -39%) is a **host/scheduler** + problem, distinct from the decode **kernel** (at parity in batched-bench). The + README's BW-floor/host-loop-residual findings are kernel-regime and do not + bound the serving regime. +- Leading mechanism: continuous batching's **batch-shape + seq-set churn breaks + both graph-reuse layers** (llama-context `can_reuse`, CUDA `update_required`) + every step, so the GPU idles while the host rebuilds + re-captures + runs + un-graphed `set_inputs`. vLLM avoids this with padded/bucketed decode shapes + + piecewise CUDA graphs. +- The shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill + freezing + burst collapse, **not** decode-step graph reuse - which is why the + serving gap survives them. +- Top levers (all host-side, bit-exact-safe): **S1** bucketed/padded decode-step + shape for graph reuse, **S2** double-buffer/overlap per-step host work, **S3** + graph-shape-stable scheduling (extend 0016). Gate everything on **Phase 0**: + the `graphs reused` rate + `[L5INSTR]` host buckets + nsys GPU-busy% in real + `llama-server` serving vs batched-bench, with vLLM ground-truthed at the same + concurrency. + + diff --git a/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md new file mode 100644 index 000000000000..25ada49024e4 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/EXECUTION_REARCH_SCOPE.md @@ -0,0 +1,1272 @@ +# EXECUTION_REARCH_SCOPE: porting vLLM's execution architecture into the paged fork (additive program) + +Status: scope, not a result. This document reopens the GB10 vLLM-parity work on a +new thesis and lays out a phased, additive, falsifiable program. It supersedes the +per-lever "hardware floor" framing of [`VLLM_PARITY_FINAL.md`](VLLM_PARITY_FINAL.md) +*where that framing was wrong*, and keeps it *where it was right*. Read +[`VLLM_PARITY_FINAL.md`](VLLM_PARITY_FINAL.md), +[`VLLM_PARITY_LEVER_MAP.md`](VLLM_PARITY_LEVER_MAP.md), +[`PARITY_HANDOFF.md`](PARITY_HANDOFF.md) and +[`PREFILL_GEMM_RESULTS.md`](PREFILL_GEMM_RESULTS.md) before acting on anything here. + +Target model + hardware are unchanged: Qwen3.6 NVFP4 (dense 27B + MoE 35B-A3B hybrid +GDN-SSM) on GB10 / DGX Spark (sm_121a, mma.sync only, LPDDR5x ~273 GB/s). Reference +engine is vLLM v1 on the same GB10. + +--- + +## 1. Reframing: the 2-3x is software architecture, not silicon + +The prior two campaigns (June, then a 141-phase reopened one) A/B'd every single kernel +and every single execution-model boundary in isolation and rejected them, and concluded +"hardware floor". That conclusion is a **per-lever** verdict and it conflated two +different kinds of floor. On the *same silicon* vLLM is 2-3x faster at prefill and +serving; a same-silicon multiple is by definition a software-architecture delta, not a +hardware limit. The correct reframe: + +**Truly shared-hardware floors (bind vLLM too; not engineering debt, do not re-litigate):** + +1. **The high-N GDN recurrent-scan bandwidth plateau.** The scan moves ~32 GB/step of + f32 recurrent state, is 51% of decode and LINEAR in batch; both engines show the + same sublinearity (1.17-1.18x throughput for a 2x batch). Paged runs it at **83% of + the 273 GB/s LPDDR5x peak vs vLLM's 79%** - on this one floor paged already **leads**. + Lifts ~30x on B200 HBM, not on GB10. +2. **bf16 tensor-core peak = ~half FP4 peak on sm_121**, with no tcgen05 / CUTLASS + grouped-FP4 on consumer Blackwell (CUTLASS #3096). This is why vLLM itself runs a + bf16-Marlin fallback here and why native FP4-MMQ is optimal; it caps any + dequant-to-bf16 alternative for **both** engines. +3. **The GDN O(C^2) intra-chunk triangular solve under the 99 KB smem cap forcing C=16.** + Occupancy is not the bound (block-vote A/B: -1.04%); dtype is not the bound + (bf16-C64: -18.75%; explicit blocked-inverse: 0.59x of direct solve, Phase74). Joint + algorithm-plus-hardware ceiling. + +**ggml-architecture-conditional floors (the real "same-silicon 2-3x"; this program's target):** + +1. **The per-cgraph-node materialize-everything executor.** Root cause of the -79.4% + act-quant-into-MMQ failure, the inexpressible norm+quant+silu fusion, the + +21.4 us/tok convert/glue tax, and all six MoE-transplant regressions. vLLM's + persistent kernels + Triton fusions + expert-major pipeline never create these + intermediates. Unclosable one-boundary-at-a-time; must be a complete fused rewrite. +2. **The prefill grouped-GEMM tiling quality** (+56.5 us/tok). ggml grouped-MMQ shatters + into ragged small-M-per-expert tiles; vLLM's aggregated expert-major grouped GEMM + keeps tensor cores full at the *same* bf16-peak ceiling. Ceiling is hardware; the + tiling maturity gap to it is software. +3. **The ~17 pt serving graph-reuse overhead.** vLLM's padded/bucketed decode shapes + + piecewise CUDA graphs keep the GPU fed; ggml rebuilds/re-captures on batch-shape + churn. Largely closed by S1/D1; residual is S3-recoverable, bit-exact-safe. +4. **The ~8 pt vLLM server-number inflation** is pure measurement (chunked-prefill + overlap inflating vLLM's own server window), not a floor at all. + +**Goal of this program:** port vLLM's **execution architecture** (token-budget scheduler, +persistent-buffer full-graph execution, expert-major single-launch MoE, persistent-CTA +weight-reuse GEMM, chunked blocked-solve GDN, bf16-resident activation stream) into the +fork **additively** (new files, narrow additive hooks, default-off env gates), and let +the existing CUDA-only kernels slot in underneath. The failed ports failed not because +their kernels are GB10-hostile (mostly they are portable) but because each was dropped +one boundary at a time into an executor that materializes every intermediate to LPDDR5x, +so each partial port paid the temp-traffic cost without the persistent-kernel benefit. + +--- + +## 2. Why vLLM is faster on GB10 (ranked attribution + port forensics) + +All numbers are tagged. Source keys: **CDEF** = `dgx:~/bench/COMBINED_DEFINITIVE.txt` +(same-session both-engine, GIT_HEAD a7d439e). **LMAP** = +[`VLLM_PARITY_LEVER_MAP.md`](VLLM_PARITY_LEVER_MAP.md) profile-validated section +(both-engine nsys). **HNP** = graph-node-traced decode profile +(`--cuda-graph-trace=node`; `dgx:~/highN_prof2/`, `~/highN_vllm/`). **PGR** = +[`PREFILL_GEMM_RESULTS.md`](PREFILL_GEMM_RESULTS.md). **VPF** = +[`VLLM_PARITY_FINAL.md`](VLLM_PARITY_FINAL.md). **PH** = +[`PARITY_HANDOFF.md`](PARITY_HANDOFF.md). + +### 2a. Prefill (paged 395.9 vs vLLM 197.0 us/tok; gap 198.9; MoE 35B-A3B decision model) + +Prefill is NOT CUDA-graph-replayed, so these buckets are real per-token costs. + +| Rank | Bucket | Delta us/tok | % gap | Mechanism (paged vs vLLM) | +|---|---|---:|---:|---| +| 1 | GDN prefill scan | +59.2 | 30% | hand f32 chunked scan `gdn_core` 95.7 vs vLLM FLA `chunk_gated_delta_rule` 36.5 = **2.62x**; O(C^2) intra-chunk solve + serial cross-chunk carry, C forced to 16 by 99 KB smem | +| 2 | GEMM pipeline | +56.5 | 28% | grouped-MMQ (FP4 wt x Q8_1 int8 act) 105 vs Marlin W4A16 (FP4->bf16 in-register + bf16 mma) 48.5 = **2.16x**; loses on ragged small-M-per-expert tiles under-utilizing TC, NOT a GEMV collapse | +| 3 | activation-dtype boundary tax | +21.4 glue + 15.2 act-quant = **+36.6** | 19% | `convert_dtype` 6.3% + `concat` 2.9% of wall are pure dtype/layout glue vLLM's bf16 stream never materializes; plus act-quant vLLM structurally does not pay (W4A16 = bf16 activations, zero act-quant) | +| 4 | projections + norms + gate | bf16-proj +13.7, gate +12.4, norms +11.1 = **+37.2** | 19% | paged runs these as separate memory-bound ggml ops; vLLM keeps FP8 projections and fuses norm/gate into Triton kernels | +| 5 | scheduler / MoE dispatch | +5.9 | 3% | explicit argsort+mm_ids+gather_mmq 8.6 vs 2.7; both cheap. vLLM runs its own count_and_sort/moe_align, does NOT fuse dispatch into the GEMM epilogue on GB10 | + +Sum of deltas = 195.4 ~ 198.9 (rounding): **the buckets close the measured gap.** +The executor-model tax is not a separate row; it is the *cause* of buckets 2, 3, 4. +Prefill S_PP ratios (CDEF, batched B=32): MoE **36.0% / 35.6%** of vLLM at PP=512/2048; +dense **42.2% / 42.8%**. + +**Note on the retired 232/68 claim.** `PREFILL_GEMM_SCOPE.md` flagged the "GEMM bucket +232 vs 68 us/tok" numbers as uncommitted early ground-truth needing re-confirmation. +The both-engine nsys re-confirmation revised them to **105 vs 48.5** (2.16x), and +reassigned the missing ~127 to the paged GDN scan (95.7 us/tok) and act-quant +(19 us/tok). **GDN scan, not GEMM, is the #1 prefill contributor.** Any reasoning that +still cites 232/68 or "GEMM is ~51% of the gap" is stale. + +### 2b. Serving / decode (the ~56% headline reconciled to ~86%) + +The old "paged decode 159 us/tok, GPU ~16% busy, host-bound" was a **measurement +artifact**: `nsys` without `--cuda-graph-trace=node` collapses each replayed decode +graph into one opaque launch. Re-profiled correctly (HNP), paged decode at npl=256 is +**99% GPU-busy (idle 1.4%), not host-bound**. + +Real decode decomposition (paged npl=256, HNP; GPU-steady 1082 us/tok = 924 t/s): + +| Bucket | us/tok | % decode | Note | +|---|---:|---:|---| +| GDN recurrent scan | 553 | 51% | LINEAR in batch; shared BW floor where **paged LEADS** (83% vs 79%) | +| NVFP4 expert GEMM | 254 | 23% | amortizes with batch; paged competitive | +| bf16 projections | 73 | 7% | vLLM uses FP8 here | +| elementwise | 57 | 5% | vLLM fuses into one Triton kernel | +| SSM conv | 31 | 3% | | +| GPU-idle | - | 1.4% | not host-bound | + +Reconciliation chain (must sum): + +| Measurement | t/s | % of vLLM-server | +|---|---:|---:| +| vLLM server (CDEF) | 1177 | 100% | +| vLLM **true GPU-steady** | 1078 | 92% (~8 pt = vLLM chunked-prefill-overlap window inflation) | +| llama **GPU-steady** | 924 | 78.5% (**= 86% of vLLM's true 1078**) | +| llama server (CDEF) | 718 | 60.7% (~17 pt = serving graph-reuse overhead, S3-recoverable) | + +Serving gap = **~8 pt measurement + ~17 pt scheduler/graph-reuse (recoverable) + +~14 pt GPU-steady kernel residual**. The 14 pt residual = MoE fused-Marlin +persistent-tiling (~+11 ms) + Triton elementwise fusion (~+10 ms). Decode CDEF ratios: +MoE perseq **70.0/65.2/59.4/55.6%** at N=8/32/128/256; **dense 116.7% at N=8** (paged +ahead) falling to 62.1% at N=256. + +### 2c. Single-stream tie vs batched 2.4-2.8x divergence: which property is load-bearing + +At single-stream / small-M both engines are weight-bandwidth-bound and the GEMM inner +loop is the same order of work, so they tie (corroborated in kind by the committed +"tie at static-wide-128", paged 782 vs vLLM ~819 t/s). When batched to B=32 x PP=512 +the workload becomes **compute-bound** and three M-invisible properties dominate: + +1. **Tensor-core utilization on aggregated large-M work.** vLLM's expert-major grouped + GEMM keeps TC full; grouped-MMQ shatters top-8-of-256 into ~4 tok/expert ragged + tiles (the +56.5 us/tok bucket, batched-only). +2. **The GDN chunked scan only exists at batched prefill** (decode uses the recurrent + path); its O(C^2) intra-chunk solve is the +59.2 us/tok #1 bucket, no single-stream + analogue. +3. **act-quant + convert/glue are M-proportional** (+36.6 combined), negligible at M=1. + +**Load-bearing property = tensor-core utilization on aggregated large-M work +(grouped-GEMM tiling quality + the GDN tensor-core solve), i.e. compute-kernel maturity, +not scheduling.** Dispatch is only +5.9 us/tok / 3% of the batched gap. This challenges +the older "dense AND MoE both converge to ~41% ⇒ scheduler-localized" interpretation: +the convergence reflects a **shared per-token compute structure** (dense and MoE share +the GDN + projection + norm stack; MoE just adds the expert GEMM), and the definitive +decomposition attributes ~97% of the batched-prefill gap to GPU compute kernels, ~3% to +dispatch. + +### 2d. Port forensics: kernel-intrinsic-on-GB10 vs ggml-integration-tax + +| Lever | Verdict | Why (integration tax vs kernel-intrinsic) | +|---|---|---| +| **0033** dequant-to-bf16 cuBLAS (dense large-M) | REJECTED -49/-42/-29% at M=512/1024/2048 (PGR) | BOTH: a separate global-memory dequant pass (~8x the FP4-MMQ read traffic, un-amortized), AND bf16 peak = ~half FP4 peak on sm_121 (real ceiling). GB10-hostile as a bf16-dequant approach. Bit-exact, KL-better; correctness never the issue | +| **0034** native FP4-MMA W4A4 | REJECTED in-backend despite winning PoC | PoC: 103 TFLOP/s = 57.7% FP4 peak, NMSE=0, beat cuBLAS-bf16 (kernel portable-in-principle, could *exceed* vLLM). Integration tax dominated: surrounded by act-quant + f32 converts + per-node launch. **Portable-with-prereqs** (fuse act-quant into GEMM prologue, remove f32 converts, live in the CUDA graph) | +| **0035** W4A16-Marlin grouped MoE | REJECTED -39% S_PP, correct + KL-better (KLD 0.131 < MMQ 0.137) | vLLM's *exact* sm_121 shape. Lost because the ggml drop-in still sat in ggml's materialize-every-node grouped-`mul_mat_id` harness at ragged small-M. **Portable-with-prereqs = the whole persistent expert-major executor, not the Marlin inner kernel.** Decode Marlin port lost -19.6% for the same reason | +| **Six one-boundary MoE transplants** (Phase113/114/122/123/125/127) | ALL REJECTED (flat or regress) | Phase124 profile: `mmq_nvfp4` 30.17% + `gdn_core` 29.25%, `act_quant` only 3.35%. Each transplant either attacked a boundary too small (122/123 flat) or added a sorted/padded temporary whose LPDDR5x traffic exceeded the boundary it removed (113/114/125/127 regress). **Portable-with-prereqs, and the prereq is all-or-nothing:** the win exists only as a complete fused persistent expert-major kernel | +| **bf16-C64 GDN** | REJECTED -18.75% | Kept our O(C^2) form-T solve and grew C to 64: makes the O(C^2) solve + serial recurrence worse; C=32 full-width needs 127 KB > 99 KB smem. Separately, Phase74 tested vLLM's blocked `solve_tril` standalone (C=64, tf32): explicit inverse-plus-apply ran at **0.59x** the direct solve (1.7x slower), smem at 98304/99 KB. Blocked-inverse validated **GB10-hostile** on this silicon. Shipped winner = M5 tf32 C=16 (+3.5% npp512, +17.7% npp2048) | + +--- + +## 3. The phased additive program + +Ordered by (expected recovery x confidence) / effort. Each phase names the ggml/fork +seam (Audit C), the files, the default-off env gate, the correctness gate (per-path md5 +if math-preserving, KL band if dtype-changing), a **falsifiable P0 kill-gate** with a +numeric go/no-go, the expected-recovery arithmetic grounded in section 2, effort, the +prior rejected lever it supersedes with the **missing prereq** that made the prior +rejection not apply, and upstream-clash / rebase-safety. + +The phases are **ordered and dependent**: P3 requires P1+P2 landed. That dependency is +precisely why the isolated 0034/0035 A/Bs failed - each was tested without its two +predecessors. + +Fork seams referenced below are against local `mudler/llama.cpp:localai-paged` +HEAD `1edddc8fe` (patch series 0001-0052; all file:line references below are against +that tree). The tree carries the MoE-region seam (patch 0052, `moe-ffn.cu` + the +whole-pattern matcher) and the grouped W4A16 Marlin prefill path (patch 0035). It does +**not** carry any P1/P3 scaffolding: the four experiment commits an earlier campaign +prototyped - `237ad9b96` bf16 GDN state cache, `afc2c7030` act-quant-route trace, +`ea0875d14` `LLAMA_BF16_CUBLAS_F32_OUT`, `7967ad47f` W4A16 direct-A stub - were +**trimmed** from the series by the immediately-preceding commit (`b529cc5420`, sync to +fork `1edddc8fe`) and no longer exist in the tree; they survive only as recorded +experiments in [`PARITY_HANDOFF.md`](PARITY_HANDOFF.md). P1's bf16-cuBLAS plank and P3's +direct-A stub therefore must be **re-introduced**, not "finished". The team has not +started P2/P4/P5/P6. + +### P1: bf16-native execution pass (kill the f32 convert / act-quant boundary tax) + +- **Goal:** delete the convert-in/convert-out on every op boundary and run + norm/add/rope/silu at half the memory traffic, so the residual/activation stream is + bf16-resident (as in vLLM) rather than f32-resident with bf16 only as an in-GEMM + transient. Targets prefill bucket 3 (+36.6) + part of bucket 4 (norms +11.1, glue), + and decode elementwise (57 us/tok, 5%). +- **Mechanism (Audit C Area 1, option A):** extend the existing fusion pass + `ggml_cuda_try_fuse` (`ggml-cuda.cu:4232`, called per node in the capture loop at + `:4908`) to recognize a residual-stream *segment* (norm -> proj-GEMM -> add -> norm) + and execute it through bf16 variants that keep the intermediate in a bf16 pool buffer, + converting to f32 only at the boundary a non-owned node reads. The GEMM already + computes through bf16 tensor cores; the win is deleting the per-op converts, not the + GEMM. Plank 1 is to re-introduce `LLAMA_BF16_CUBLAS_F32_OUT` (prototyped in the + trimmed `ea0875d14`, now absent from the tree - see section 3): GEMM writes f32 + directly from bf16 compute, skipping the round-trip pool alloc + convert. Reject option B + (bf16 tensor types at graph build in `llama-model.cpp`/`llama-graph.cpp`): it edits + the most rebase-sensitive shared files and forces a hard cut with no per-segment + opt-in; hold it for a datacenter-Blackwell reopen. +- **Files:** new `norm-bf16.cu` (rms_norm + the two 0042/0044 fused norms, templated on + IO dtype), bf16 case in `binbcast.cu` (residual add), bf16 instantiation in `rope.cu`, + bf16 `UNARY+MUL` SiLU-gate; the segment-detect rewrite as ONE additive clause in + `ggml_cuda_try_fuse`. GDN glue + attention io already bf16 (`gated_delta_net.cu`, + fattn). ~400-600 LOC. +- **Env gate:** `LLAMA_BF16_STREAM=1` (default off). +- **Correctness gate:** **KL band** (bf16 intermediates change accumulation; the + bit-exact md5 gate cannot hold and must not be forced). vLLM itself runs bf16 here so + the reference precision is the same. KL-benign category per + [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md). +- **P0 kill-gate:** wire `LLAMA_BF16_STREAM` for ONE residual segment + (norm -> proj -> add) only; A/B the MoE-decision-model prefill wall at PP=512 with + `--cuda-graph-trace=node`. **GO** if the convert/glue share (`convert_dtype` 6.3% + + `concat` 2.9%) drops by >50% of its share AND KL vs the f32 reference stays in band + (same-top-p >= 84%, KLD delta < 0.01). **NO-GO** if net prefill regresses beyond + noise (> max(2%, 3 sigma) of control medians) - which would mean the segment-boundary + converts eat the win. +- **Expected recovery:** conservative ~30 of the +36.6 bucket-3 tax + ~15 of bucket-4 + (norms/glue) + the decode elementwise 57 us/tok fused. Prefill: ~45 us/tok. +- **Effort:** medium (templated re-instantiations + one rewrite clause). +- **Supersedes:** the -79.4% act-quant-into-MMQ fold and the +21.4 convert tax. + **Missing prereq now supplied:** those failed because the activation reached the GEMM + as f32 and every op boundary re-converted; a bf16-resident segment removes the + boundary entirely rather than folding the quant into an MMQ that has no TC for the + inline quant. +- **Upstream-clash / rebase-safety:** new `.cu` files are rebase-inert; the only shared + edit is one additive clause in `ggml-cuda.cu` (8 patches + upstream fusion churn - + the hottest surface, keep growth to the single clause). Do **not** add ggml tensor + types (avoids `ggml.h`, 5 patches). Rides upstream fusion machinery (`ggml_can_fuse`, + discussion #17621) by adding new clauses, not editing upstream's. + +#### P1 RESULT (landed 2026-07-02, `LLAMA_BF16_STREAM`, default-off) + +The bf16-resident residual-segment executor landed as three fork commits on +`mudler/llama.cpp:localai-paged` (new HEAD `653bb2f3d`, tree `6cf1523047`, base +`1edddc8fe`): `1271488fc` (segment executor + `norm-bf16.{cu,cuh}` + the +re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank), `91373e1b9` (bf16 residual-add ++ rope op-variants), `653bb2f3d` (test sentinel). LocalAI series regenerated +additively as `0053-0055` (46 patches total); kill-gate at pin `0ed235ea`: all +patches apply and stage tree `6cf1523047` byte-for-byte == fork HEAD tree. + +- **Mechanism as-shipped (Option A, as scoped).** One additive clause in + `ggml_cuda_try_fuse` detects a residual-stream norm-producer (plain + `{RMS_NORM,MUL}` attn/GDN input norm, or the 0044 `{SILU,RMS_NORM,MUL,MUL}` + ssm_out gated-output norm) whose f32-output consumers are ALL large-M (M>=128) + cuBLAS-bf16 projections, runs the norm into a bf16 pool buffer via + `norm-bf16.cu` (bit-faithful to the f32 kernels up to the `__float2bfloat16` + store), executes the owned span inline through a bf16 view, then skips it. A + strict all-consumers-are-ours guard keeps the f32 norm un-materialised and + bails to the stock f32 path on small-M / decode / MMQ / native-FP4 / + multi-consumer. The `LLAMA_BF16_CUBLAS_F32_OUT` plank lets owned projections + write f32 directly from bf16 compute (F32_OUT else-branch byte-identical to the + original cuBLAS path). No upstream fuse clause edited; exactly 6 files, cmake + untouched (`.cu` globbed). +- **KEY REFRAME (why a first guard engaged 0).** q36 GDN/attention projections + (attn_qkv/gate, ssm_alpha/beta/out) are **BF16 weights, NOT NVFP4**; only the + MoE experts (`ffn_*_exps`) are NVFP4. The convert tax therefore lives at the + BF16 cuBLAS projection boundary (`op_mul_mat_cublas` src0==BF16 converts f32 + src1->bf16), not on the FP4-MMQ path (which pays act_quant, not convert). The + dense model quantizes its attn/GDN projections to NVFP4, so it **engages + nothing** and stays bit-identical. **bf16-stream is a MoE-model prefill lever.** +- **P0 kill-gate (`~/bench/p1_bf16_stream/killgate_20260702_135544`): GO.** One + segment (960 gate_norm->ssm_out engagements/prefill). `convert_unary` + fell 6840->5880 = exactly -960 (163.19->130.73 ms, -19.9%; share 2.27%->1.83%) + = 100% within-owned-segment drop (the kill-gate's stated criterion), no + boundary convert added. KL: control and bf16 arms **byte-identical** (KLD + 0.136563 both, same-top-p 83.725% both) => KLD delta 0.000 < 0.01. Prefill S_PP + +0.53% (2323.24 vs 2310.94 t/s), inside the 3-sigma noise gate. Default md5 + GREEN both models. (The total convert bucket only moved 4.83%->4.40% because + the minimal segment owns 1 of ~5 BF16 cuBLAS GEMMs per GDN layer; the >50% GO + is the within-segment 100%.) +- **P1 full build-out: 2240 segments/prefill** (2.33x P0's 960) = 960 + gate_norm->ssm_out (0044, single-consumer) + 1280 multi-consumer plain + rms_norm -> {attn q/k/v, GDN in_proj} BF16 projections. Prefill A/B (5 iters, + clean, captured before external contention): MoE @512 B=32 **+1.99%** + (2361.67 vs 2315.52 t/s; all 5 bf16 samples above all 5 ctrl; reproduced +1.89%), + @2048 B=8 +0.95%; dense @512 -0.09% / @2048 -0.10% (no-op). Recovered ~8.44 + us/tok @512 (wall 431.87->423.43), ~4.02 @2048. Both MoE deltas sit at the + max(2%, 3-sigma) floor => classified neutral, but consistent and reproducible + positive shifts; no prefill regression => not a NO-GO. Decode S_TG neutral + (M<128 bails). +- **KL gate GREEN (both models).** MoE bf16 KLD 0.136042 vs control 0.136563 => + delta **-0.00052** (bf16 slightly better: F32_OUT keeps the full f32 GEMM + result instead of the old bf16 round-trip), inside the +0.01 band; same-top-p + 84.461% vs 83.725% (>= 84% baseline). Dense: 0 engagements => bit-identical + (KLD delta 0, same-top-p 100%). +- **All correctness gates GREEN.** Default md5 canonical both models + (MoE `8cb0ce23`, dense `5951a5b4`); env-on md5 canonical both (small-M bails); + `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET + 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4 + (default AND opt-in). Files: binbcast.cu +10, ggml-cuda.cu +297, norm-bf16.cu + +483, norm-bf16.cuh +37, rope.cu +31, test-backend-ops.cpp +79. +- **Honest magnitude / what remains.** The +1.9-2.0% @512 win is real, + reproducible, KL-benign (in fact KL-improving), and safe, but modest: + bf16-stream targets only prefill bucket 3 (the ~4.8%-of-wall convert/glue tax) + and owns the projection-boundary portion of it (~40% end-to-end), not the + GDN-scan (bucket 1) or GEMM-tiling (bucket 2) buckets. Read the "expected + recovery: ~45 us/tok" line above as an upper bound on the whole bucket-3+4 + region; this landing captures the bucket-3 projection boundary only. The next + P1 increment on the table = extend the multi-consumer executor to own the + bf16->f32 dst direction plus the remaining attn_norm-fed projection src1 + converts (~4 more converts/layer). Deferred (blocked only by an external + imatrix job contending the GPU, not a failed gate): the nsys graph-node bucket + table, decode S_TG @npl128, and the Phase130 serving A/B need a clean idle GB10 + re-run; the scope deems throughput-neutral serving acceptable on GB10. + +### P2: expert-major fused routed-FFN region executor (grow the merged MoE seam into the real thing) + +- **Goal:** drive both MoE GEMMs expert-major so the gate_up output never lands in + global memory, deleting the one intermediate still materialized today and the + redundant per-GEMM sort. Targets prefill bucket 2 (+56.5, the ragged-tile tax) and the + decode MoE fused-Marlin ~+11 ms residual. +- **Mechanism (Audit C Area 2):** the seam already exists. `moe-ffn.cu` + + `ggml_cuda_moe_whole_pattern_detect_early` (`:4157`) matches the + `gate_up (MUL_MAT_ID) -> VIEW -> SWIGLU -> down (MUL_MAT_ID)` chain and the hook + returns the node-skip count so the graph advances past the region. But it is a + *partial* executor: `ggml_cuda_moe_routed_ffn_poc` (`moe-ffn.cu:275`) still runs the + first GEMM as the stock node and **materializes its full `[2*n_ff, n_expert_used, + n_tokens]` intermediate**, only then fusing SwiGLU+quant (into the finalize epilogue + it also folds the weighted combine). A true region executor routes once, keeps the + token-sort/`ids_meta` resident, feeds each expert's gate+up tile straight into the + fused SwiGLU+quant into the down GEMM, and emits one unpermuted+combined result. +- **Files:** new ~400-600 LOC fused two-GEMM expert-major loop in `moe-ffn.cu` + (fork-owned), ~30 LOC hook change in `ggml-cuda.cu`. mmq.cu touched (5 patches). +- **Env gate:** new default-off env (e.g. `LLAMA_MOE_REGION_EXECUTOR=1`). +- **Correctness gate:** **KL band** (expert-major fusion changes FP accumulation order; + the finalize path is already recorded KL-benign, paged-MoE md5 `8cb0ce23`). +- **P0 kill-gate:** implement the expert-major region for ONE projection pair (remove + the materialized gate_up); A/B `MOE_SWIGLU_DOWN` + `MUL_MAT_ID_RAGGED_MOE` at + n=128 and n=257. **GO** if the n=257 (batched large-M) rows improve > 5% over the + grouped-MMQ control with the KL gate green. **NO-GO** if flat/regress like the six + prior transplants (that is the null hypothesis this phase must beat; a single removed + boundary is not enough, the whole region must be owned). +- **Expected recovery:** conservative ~40 of the +56.5 bucket-2 prefill tax (approaches + the bf16-peak ceiling with full TC utilization) + the ~11 ms decode MoE residual. +- **Effort:** high (single-kernel fused rewrite; the load-bearing lift of the program). +- **Supersedes:** all six one-boundary MoE transplants (113/114/122/123/125/127). + **Missing prereq now supplied:** those paid the sorted/padded temp-traffic cost + without the persistent-kernel payoff because they ported one boundary into a + materialize-every-node cgraph; the win exists **only** as the complete fused region + that never materializes the intermediates. +- **Upstream-clash / rebase-safety:** the kernel is fork-owned in `moe-ffn.cu` + (rebase-inert); the hook is one narrow block in `ggml-cuda.cu`. Must keep the strict + view/consumer guard (region ownership is safe-by-construction but narrow: bail to + node-at-a-time if any other node reads `gate_up`/`glu`). **Open q for q36:** confirm + the dense shared-expert-per-layer does not alias the routed `gate_up` view before + widening ownership. CUDA-graph capture: all region kernels run inside the capture + loop; keep every pool alloc shape-stable across replays (keyed on n_tokens/n_experts, + never on data-dependent routing counts) or it forces re-capture. + +#### P2 RESULT (NO-GO, recorded 2026-07-02, `LLAMA_MOE_REGION_EXECUTOR`, default-off) + +The layout-only expert-major region executor was implemented, correctness-proven +on the synthetic sentinel, and A/B'd against the grouped-MMQ control at the P0 +kill-gate. **Verdict: NO-GO on two independent signals; nothing built beyond P0, +nothing landed.** The topic branch `p2-moe-region` is retained on the DGX fork for +forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `localai-paged` +`653bb2f3d`, NOT pushed); the fork `localai-paged` HEAD is **untouched at +`653bb2f3d`** and the LocalAI series stays at 46 patches (`0001-0055`). This +records P2-at-this-granularity as a confirmed floor. + +- **(1) Primary GO metric FLAT (the kill-gate's stated criterion).** The kill-gate + required the n=257 (batched large-M) `MOE_SWIGLU_DOWN` rows to improve **> 5%** + over the grouped-MMQ control. Measured (region arm vs grouped-MMQ control, 5x + medians): control **1021.61 us**, region **1022.15 us** => **-0.05%** + (marginally slower). n=128: 804.87 vs 807.63 = -0.34%. `MUL_MAT_ID_RAGGED_MOE` + (lone MUL_MAT_ID, region never engages there): n=257 +0.48%, n=128 +0.28% (pure + noise, confirms no perturbation of the standalone grouped MMQ). All four deltas + sit inside the 5-sample spread => sentinel flat. **This reproduces the six prior + one-boundary MoE transplants (phases 113/114/122/123/125/127) - the null + hypothesis the scope said P2 had to beat.** A compact expert-major layout + a + single route-sort, with both GEMMs still ragged grouped-MMQ, does not move the + sentinel; the ragged-tile tiling (the actual +56.5 bucket-2 tax) is *unchanged* + by a layout swap. Closing bucket 2 needs P3's Marlin persistent-CTA aggregation, + not a P2 layout change. + - *Methodology caveat on the sentinel (reported as-is, it is the requested + metric):* `test-backend-ops` `eval_perf` duplicates only the down/out node + ~n_runs (~1000) times per timed iteration, so the single region invocation is + ~1/n_runs of the signal => the perf sentinel is structurally under-sensitive to + the region change. The flat verdict is corroborated by signal (2). (The n=257 + `MOE_SWIGLU_DOWN` case was added to both `make_test_cases_eval` and + `make_test_cases_perf`; the eval list already had n=128.) +- **(2) DECISIVE STRUCTURAL BLOCKER: the seam does not match q36's decision + graph.** `q36-35b-a3b-nvfp4.gguf` ships **separate** `ffn_gate_exps` + + `ffn_up_exps` (+ per-tensor `.scale`/`.input_scale`), **NOT** a merged + `ffn_gate_up_exps` (verified by GGUF tensor-name scan). `llama-graph.cpp` + `build_moe_ffn` therefore takes the separate-gate/up branch => + `ffn_moe_gate_scaled` + `ffn_moe_up_scaled` + `ggml_swiglu_split`. The + whole-pattern matcher `ggml_cuda_moe_whole_pattern_detect_early` requires the + merged `gate_up(MUL_MAT_ID) -> VIEW -> VIEW -> SWIGLU -> down` shape, which is + **absent** on q36. Result: `LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE` fires **0x** on + q36 (prefill AND decode); the region executor engages 0x; the pre-existing + POC/fused-quant (`LLAMA_MOE_ROUTED_FFN_POC=1 +FUSED_QUANT=1`) also engages 0x. + The region only engages on the synthetic merged-shape test sentinel (7 + engagements/pass, `MOE_SWIGLU_DOWN` 8/8 nmse-correct). **Even a positive sentinel + could not have translated to q36 without first extending the matcher + POC to the + separate/scaled/swiglu-split shape.** +- **KL gate: in-band but VACUOUS.** control KLD 0.136563 / same-top-p 83.725%; + region KLD 0.136563 / same-top-p 83.725% => delta **0.000000**, byte-identical. + In-band (delta < 0.01, top-p >= 84 baseline) but only because the region engages + 0x on q36 - it is not a KL-neutrality claim for the executor (that is the separate + 8/8 NVFP4 nmse sentinel). +- **S_PP @512 (npp512 ntg4 npl32, 5x):** control 2320.62 t/s (stdev 0.23%), region + 2316.70 t/s (stdev 0.24%) => -0.17% (flat; region == control at 0 engagement; + code-present, no regression). **Capture stability:** region S_PP stdev 0.24% + across 5 iters = no CUDA-graph re-capture thrash (pool allocs keyed on + n_tokens/n_experts held shape-stable). +- **All correctness gates GREEN, both arms** (default AND + `LLAMA_MOE_REGION_EXECUTOR=1`): `test-backend-ops` MUL_MAT 1146/1146, MUL_MAT_ID + 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, MUL_MAT_ID_RAGGED_MOE 6/6, + BF16_STREAM_SEGMENT 4/4. Default md5 canonical both models (MoE `8cb0ce23`, dense + `5951a5b4`); env-on also canonical (greedy prompt is small-M => region bails). + Region correctness where it *does* engage is proven by the 8/8 NVFP4 nmse match + incl. n=257 (ne_get_rows=2056). +- **Implementation (correct, committed on `p2-moe-region`, NOT pushed, ~407 LOC / 6 + files).** `moe-ffn.cu` `ggml_cuda_moe_region_executor`: one route-sort (ids_meta, + cur framing); gate_up grouped NVFP4 MMQ writes a **compact expert-major buffer** + via iota `ids_dst` (the token-order `[2*n_ff, n_used, n_tokens]` intermediate + never materialised); new `moe_swiglu_nvfp4_quant_compact_kernel` reads the compact + buffer by route-slot (no ids_src1 gather); down MMQ unpermutes to token order. + Strict all-consumers guard `ggml_cuda_moe_region_consumers_ok` bails if any node + outside the 5-node region reads gate_up/views/glu (covers shared-expert aliasing). + `LLAMA_MOE_REGION_TRACE`. +- **Honest delta vs expectation.** The scope's P2 line targeted ~40 of the +56.5 + bucket-2 prefill tax + the ~11 ms decode MoE residual. **Delivered: 0** (region + flat on its sentinel and 0-engagement on the decision model). The compact + expert-major layout is the wrong lever at this granularity: it swaps *where* the + intermediate lives without changing the ragged-tile GEMM tiling that owns the + cost. +- **Prerequisite handoff (gates P2 AND P3).** Before ANY MoE-region lever can + engage on q36, the seam - the whole-pattern matcher, the POC/fused-quant, AND the + region executor - must first be **rebuilt for q36's separate + `ffn_gate_exps`/`ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN + shape**. The current seam only matches a merged shape q36 does not emit. The + correct next action is a re-scope of the seam to the separate/scaled shape as the + gating prerequisite, then re-evaluate whether a *fused two-GEMM* region (not a + layout swap) beats the sentinel - the scope's own null hypothesis holds that the + win exists only as the complete fused kernel that never materialises the + intermediates. +- **Artifacts (DGX `~/bench/p2_moe_region/`):** `focused_20260702_172644/` (perf + sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`; + `killgate_20260702_171826/` (engagement proof: `engage_moe.log`=0, + `engage_dense.log`=0); `build_20260702_145928/` (build logs). Environment: + `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, `nsys --cuda-graph-trace=node`, GPU lock + held. + +### P3: Marlin-class large-M GEMM retry, ON TOP of P1+P2 (the forensics-informed retry) + +- **Goal:** land the W4A16 Marlin-shape GEMM (FP4->bf16 in-register dequant + bf16 + mma.sync + cp.async double-buffer + dequant-once weight reuse across 16-64 M-rows) + that vLLM uses on sm_121, now that its two prereqs exist. Targets prefill bucket 2's + residual to the bf16-peak ceiling and the ragged-tile TC collapse. +- **Mechanism (Audit C Area 4):** add a `direct_a` W4A16 path. What exists in the tree + is the **grouped** W4A16 Marlin path (patch 0035: `w4a16-gemm.cu`/`w4a16-gemm.cuh`, + engaged by `ggml_cuda_w4a16_moe_grouped_should_engage` at the hook `ggml-cuda.cu:2797` + [`paged patch 0035`], gated by `LLAMA_W4A16_PREFILL_M>0`). What it lacks is a direct-A + variant that takes `src1` f32 directly with an `ids_to_sorted` map, fusing the + activation cast into the kernel and skipping both the host-side expert-sort and the + separate act-quant pass (the +15 us/tok the FP4-MMQ path pays). An earlier campaign + prototyped exactly this as the trimmed `7967ad47f` + (`ggml_cuda_mul_mat_id_w4a16_grouped_direct_a`, a `w4a16-policy.h` engage gate + `ggml_cuda_w4a16_direct_a_should_engage_params`: NVFP4 src0, f32 src1/dst, Blackwell, + `LLAMA_W4A16_PREFILL_M>0`, tokens > M, `k%64==0 && n%128==0`, unit-tested in + `test-cuda-w4a16-policy.cpp`), but that stub, its policy header, and its test were + **trimmed** (see section 3) and are **not** in the tree - they must be re-created on + top of the grouped path, with a new direct-A hook alongside the grouped one. Add a + one-time host-side weight repack cache into Marlin's interleaved layout (fork-owned + loader in `llama-model-loader.cpp`, off the per-step path). +- **Files:** the grouped Marlin kernel exists (`w4a16-gemm.cu`, fork-owned); the + direct-A variant (~300 LOC) + its policy header + unit test must be re-added, repack in + `llama-model-loader.cpp`, a new direct-A hook in `ggml-cuda.cu`. +- **Env gate:** `LLAMA_W4A16_DIRECT_A=1` + `LLAMA_W4A16_PREFILL_M>0` (default off). +- **Correctness gate:** **KL band** (bf16 dequant path; already characterized + KL-benign-and-better, KLD 0.131 < MMQ 0.137). +- **P0 kill-gate:** with P1 (convert-free bf16 activations) and P2 (persistent region + owning the tiling) landed, engage direct-A and A/B S_PP vs grouped-MMQ at + M=512/1024/2048. **GO** if S_PP >= grouped-MMQ + 5% at M >= 1024 AND KLD <= 0.137. + **NO-GO** if it reproduces the prior -39% / -19.6% - which would mean the prereqs are + still insufficient and the executor still materializes around the kernel. +- **Expected recovery:** the remainder of bucket 2 not captured by P2, up to the + bf16-peak ceiling. Combined P2+P3 target ~40-50 of the +56.5. +- **Effort:** medium (the grouped Marlin kernel exists as a starting point, but the + direct-A variant + policy + test were trimmed and must be re-created; the larger lift + is still the P1/P2 predecessors). +- **Supersedes:** 0035 (-39%) and 0034 in-backend fail. **Missing prereqs now + supplied:** P1 delivers bf16 activations to the GEMM without converts; P2 delivers the + persistent region that owns the tiling across both GEMMs so the bf16 activation is + read once (the prior loss was ggml MMQ re-quantizing the y-operand per weight-row-tile + x stream-k split). +- **Upstream-clash / rebase-safety:** `w4a16-gemm.cu`/`.cuh` fork-owned (the re-added + `w4a16-policy.h` will be too); can ride the in-tree multi-stream `concurrent_event` + machinery (`ggml-cuda.cu:4769`, `try_launch_concurrent_event` over + `stream_ctx.concurrent_events`) for the K-loop cp.async overlap instead of a private + mechanism. + +#### P3 RESULT (NO-GO, recorded 2026-07-02, `LLAMA_W4A16_DIRECT_A` + `LLAMA_W4A16_PREFILL_M>0`, default-off) - the GEMM-tiling bucket 2 is now a CONFIRMED FP4-MMQ-OPTIMAL FLOOR + +The direct-A W4A16 Marlin path was **re-created per the section-3 contract** (the trimmed +`7967ad47f` prototype rebuilt on top of the in-tree grouped 0035 kernel), engaged behind +`LLAMA_W4A16_DIRECT_A=1`, and A/B'd against the FP4-MMQ default at the P0 kill-gate. +**Verdict: NO-GO by a wide margin (-46.9/-48.0/-49.1% at M=512/1024/2048); nothing built +beyond P0, nothing landed.** The forensics retry that motivated the phase is now +**refuted**: the integration tax the scope named (section 2d) was **genuinely removed** +(act-quant 18.92 -> ~0 us/tok on the expert path, the host expert-sort + src1-gather + +separate cast pass eliminated) and direct-A **still lost**. This settles prefill bucket 2 +(GEMM tiling, +56.5 us/tok) as a **kernel-intrinsic, FP4-MMQ-optimal floor on GB10**, +joining bucket 1 (GDN scan, P5-confirmed). The topic branch `p3-w4a16-direct` is retained +on the DGX fork at `8eef7ba4335ffd2ed7babd5e5dae71fa1fe8f688` (base `localai-paged` +`653bb2f3d`, NOT pushed); the fork `localai-paged` HEAD is **untouched at `653bb2f3d`** and +the LocalAI series stays at 46 patches (`0001-0055`). + +- **PERF GO GATE FAILED DECISIVELY.** GO required `S_PP(direct-A) >= FP4-MMQ + 5%` at + `M >= 1024` AND `KLD <= 0.137`. Measured (MoE `q36-35b-a3b-nvfp4`, killgate 3-iter + medians, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -ngl99 -fa on -ntg4 -npl32 -c73728`): + **npp512 1176.8 vs 2215.3 = -46.88%; npp1024 1201.1 vs 2309.7 = -48.00%; npp2048 1222.0 + vs 2400.2 = -49.09%.** All NO-GO by a wide margin (direct-A stdev 0.07-0.56%, clears + `max(2%, 3sigma)` with no 3-sigma question). +- **CALIBRATED NULL-HYPOTHESIS BASELINE (the -39% / -19.6% priors reproduced).** A separate + calibration run measured the in-tree **grouped** W4A16 (0035) vs FP4-MMQ at + **-43.96/-43.58/-44.72% @512/1024/2048** - reproducing and *exceeding* the historical + 0035 -39% and the -19.6% prior. **direct-A is even slower than grouped**: the fused + in-kernel f32 A-gather pessimizes the kernel further. So the harness/settings are the + same as the prior campaign (the null baseline lands where it always did), and the win vs + the old number is not a measurement-setup artifact. +- **ROOT CAUSE, fully decomposed by nsys `--cuda-graph-trace=node` (npp2048 graph-node + buckets).** The mature bf16 grouped-W4A16 expert GEMM = **323.90 us/tok = 1.97x** the + FP4-MMQ int8 expert GEMM (**~164.6 us/tok**) = **exactly the bf16 = half int8/FP4 + tensor-core peak ratio on sm_121**. Consumer Blackwell GB10 has **no bf16-peak headroom** + over FP4/int8, so a W4A16 (FP4->bf16 in-register + bf16 mma) path cannot beat the native + FP4-MMQ int8 path - the ceiling is silicon. **Novel sub-finding:** fusing the A-gather + *in-kernel* (direct-A) is a **NET PESSIMIZATION** vs a cheap separate bf16 pre-cast: it + drove the kernel **323.90 -> 451.86 us/tok (+127.96)** while removing only ~63 us/tok of + tax - a **GB10-specific inversion of P5's no-round-trips heuristic**, because an in-kernel + f32 gather doubles A-operand traffic and halves occupancy, whereas a full-occupancy bf16 + pre-cast is cheaper on this low-bandwidth memory. A residual **+30 us/tok dst-unsort + `get_rows`** the host-loop path keeps (and FP4-MMQ fuses on-device) is real but + ~1/10 of the ~2x kernel gap - even zeroed it cannot close bucket 2. +- **KL BAND GREEN / in-band (and better than the control).** direct-A **KLD 0.130260, + same-top-p 85.172%** (16-chunk canonical) vs FP4-MMQ control **0.136563 / 83.725%** => + in-band (<0.137, top-p >= 84 baseline) and slightly *better* than FP4-MMQ. Correctness was + never the issue; the bf16-dequant W4A16 path is KL-benign-and-better, exactly as the scope + predicted. It is simply slower. +- **DENSE NULL CONTROL +0.05%** (`dense_spp1024_delta_pct = 0.05`): direct-A is a MoE-only + `mul_mat_id` hook; the dense model's projections are plain `mul_mat` and are untouched. +- **All correctness gates GREEN, both arms** (default AND `LLAMA_W4A16_DIRECT_A=1` + + `LLAMA_W4A16_PREFILL_M>0`): default md5 canonical both models (MoE `8cb0ce23`, dense + `5951a5b4`), env-on also canonical both (small-M/greedy bails to the byte-identical + default); `test-backend-ops` default MUL_MAT 1146/1146, MUL_MAT_ID 806/806, + GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, plus DIRECT_A-on + MUL_MAT_ID 806/806. **Engagement PROVEN:** 7680 direct-A engagements env-on (the K=2048 + N=512 gate/up expert GEMM), **0** in default (default-silent). +- **Honest delta vs the ~40-50 of +56.5 expectation.** Combined P2+P3 targeted ~40-50 of the + bucket-2 +56.5 tax. **Delivered: 0** (P2 flat + layout-only, P3 -48/-49% and slower than + grouped). Bucket 2 is now confirmed FP4-MMQ-optimal on GB10 - the binding ceiling is the + bf16 = half-FP4/int8 tensor-core peak on sm_121, which lifts only on datacenter Blackwell + (tcgen05 / CUTLASS grouped-FP4). Corroborated by `VLLM_PARITY_LEVER_MAP.md:1100` + (offline-repack + verbatim vLLM Marlin already rejected -39% at the same bf16-peak ceiling) + - which is **also why the one-time host-side repack cache was deliberately NOT built**: a + repack changes the weight layout, not the mma dtype, so it cannot move a 1.97-2.74x + bf16-peak floor. Documented decision, not an omission. +- **Implementation (correct, committed on `p3-w4a16-direct` @ `8eef7ba43`, NOT pushed, per + the re-creation contract).** `w4a16-policy.h` (pure host-testable engage predicate: NVFP4 + src0 + f32 src1/dst + Blackwell + `LLAMA_W4A16_DIRECT_A=1` + `LLAMA_W4A16_PREFILL_M>0` + + tokens>M + `k%64==0 && n%128==0` + src1 row-contiguous) + `tests/test-cuda-w4a16-policy.cpp` + (14/14 host unit test); `w4a16-gemm.{cu,cuh}` direct-A kernel (reads src1 f32 directly via + `ids_to_sorted`, fuses f32->bf16 in the A-load, no `get_rows`/cast/intermediate, + dequant-once weight reuse) + host launcher; `ggml-cuda.cu` `mul_mat_id` hook (guards the + src1 `get_rows` + adds the direct-A dispatch). Two A-fusion variants A/B'd: **v1** + cp.async f32-staging + smem-convert (57 KB smem, npp1024 ~1201 t/s, committed as best) and + **v2** synchronous low-smem gather+convert (17 KB, ~975 t/s, worse); both < grouped < + FP4-MMQ. +- **Artifacts (DGX `~/bench/p3_w4a16_direct/`):** `calib_20260702_232353/` (grouped-W4A16 vs + FP4-MMQ calibration baseline), `killgate_20260702_235119/` (S_PP A/B 3 shapes x 3-arm x + 3-iter + dense null control + engagement + md5 + test-backend-ops; RESULTS.txt), + `nsyskl_20260703_001212/` (`nsys --cuda-graph-trace=node` `prof_{default,da,gr}.nsys-rep` + + `kern_*.csv` 3-arm buckets + 16-chunk KL `kl_{ctrl,da,gr}.log`; RESULTS.txt), + `build_v1r_*.log`. Environment: GPU lock held throughout + released; `LLAMA_MAX_BATCH_TOKENS` + unset; sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter medians + sigma. + +### P4: token-granular continuous-batching scheduler (server-side only) + +- **Goal:** one per-step token budget mixing chunked prefill + all ready decodes, with + per-seq chunked-prefill cursors, cheap recoverable preemption, and adaptive bucketed + decode emission. On GB10 this is a **TTFT + architecture-enabler** lever, **not** a + throughput lever (the prior host-loop-dead measurement is real and must be respected); + its throughput payoff is on non-GB10 silicon where decode goes host-bound again. +- **Mechanism (Audit C Area 3, Audit B section 1):** extend the shipped continuous-batch + P1 (patch 0016, `server-context.cpp:3083-3135`, the dynamic decode-first prefill + budget: `LLAMA_MAX_BATCH_TOKENS` read at `:3105`, `prefill_budget_step = + max(n_ubatch, T - n_decode_in_batch)` at `:3113`) into: (1) chunked prefill as a + first-class per-sequence cursor (each waiting prompt contributes + `min(remaining_prompt, per_slot_cap)` tokens per step and resumes next step); + (2) a `SLOT_STATE_PREEMPTED` state + release-KV-keep-prompt-tokens-re-admit transition + (the paged KV manager already supports on-demand block alloc + burst-reclaim, patch + 0024; defrag in `paged-alloc.cpp`); (3) adaptive bucketed decode widths matched to + live load (never fixed pad-to-parallel: `DECODE_SERVING_SCOPE.md` proved padding + net-negative on GB10 since decode is GPU-compute-bound). Zero ggml; llama-server owns + batch formation. +- **Files:** `server-context.cpp` (5 patches), `paged-alloc.cpp` + `paged-kv-manager.cpp` + (3 each), new pure helpers in an `server-admission-policy.h`-style unit-tested header. + ~600-1000 LOC. +- **Env gate:** new default-off env (e.g. `LLAMA_CONTINUOUS_BATCH_V2=1`). +- **Correctness gate:** **md5 bit-exact** (per-seq logits depend only on that seq's + tokens + its own paged KV; the S3 note already establishes this). This is the one + phase that stays on the sacred md5 gate rather than KL. +- **P0 kill-gate:** implement the per-seq chunked-prefill cursor + adaptive bucketing; + A/B TTFT and serving-aggregate at concurrency 8/32/128 server-side. **GO** if TTFT + under load drops > 20% with the md5 gate green AND serving-aggregate not regressed. + Throughput-neutral on GB10 is acceptable (the gate is TTFT, per prior evidence). + **NO-GO** if TTFT is flat or md5 breaks. +- **Expected recovery:** part of the ~17 pt serving graph-reuse overhead on GB10 + (conservative ~10 pt combined with S3), plus the TTFT axis (the `2377 -> 13533 ms` + TTFT scaling is scheduler-shaped; vLLM's ~3.4x better TTFT is the target). It is also + the **enabling substrate** for P2/P3 (a persistent per-seq scheduling context is the + prereq the Marlin retry's persistent tiling wants). +- **Effort:** high (largest new server-side piece, but mechanical and bit-exact-safe). +- **Supersedes:** nothing was rejected here; but it explicitly does **not** re-litigate + the S3 fixed-padding result (net-negative on GB10). **Value framing:** TTFT + fairness + + non-GB10 throughput + enabler; the GB10 throughput claim is deferred by design. +- **Upstream-clash / rebase-safety:** safest area. `tools/server/server-context.cpp` is + a fork-owned tool, not ggml core; upstream churns it less and conflicts are mechanical. + +#### P4 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_CONTINUOUS_BATCH_V2`, default-off) + +The CBv2 P0 kill-gate subset (per-seq chunked-prefill cursors + adaptive decode +bucketing) was **implemented and correctness-proven green**, but the P0 kill-gate's +stated GO criterion - a **> 20% TTFT-under-load drop** with md5 green and +serving-aggregate not regressed - was **NOT demonstrated**, so per the phased +contract `go=false` was the kill-gate default, **nothing was built beyond P0** +(no `SLOT_STATE_PREEMPTED`, no aging/starvation-freedom), and **nothing landed.** +The topic branch `p4-cbv2` is retained on the DGX fork at +`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `localai-paged` `653bb2f3d`, NOT +pushed); the fork `localai-paged` HEAD is **untouched at `653bb2f3d`** and the +LocalAI series stays at 46 patches (`0001-0055`). **This is the scope-anticipated +outcome:** the P4 section frames CBv2 on GB10 as a TTFT + fairness + architecture- +enabler lever, **not** a throughput lever (decode is GPU-compute-bound; the +host-loop-dead measurement is real), so a NO-GO on the TTFT perf gate is the +expected result and any throughput payoff lives on non-GB10 silicon (out of scope). + +- **FINAL MEASURED VERDICT (the A/B completed autonomously after the forced report; + full 60/60 raws, 5 reps per arm per shape; + `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): NO-GO CONFIRMED BY + MEASUREMENT, and stronger than flat: CBv2-at-this-granularity REGRESSES.** + TTFT-GO shapes: NONE. Measured deltas (candidate vs control medians; "clears" = + beyond max(2%, 3 sigma)): + - staggered N=32: TTFT p50 **+33.6% WORSE** (4559.3 -> 6091.3 ms, clears), mean + +31.4% worse (clears), p95 +14.3% worse (clears); agg/decode -3.3/-3.4% + (inside a very noisy ~21% gate). + - staggered N=128: TTFT p50 +15.5% / mean +17.9% / p95 +12.1% worse (all clear); + **aggregate -6.9% and decode-agg -6.9% REGRESSED beyond noise** (0.4% sd). + - burst N=128: TTFT p50 +13.5% / mean +10.5% worse (clear); agg -3.9% (clears). + - staggered N=8 and burst N=8: neutral. burst N=32: decode-agg +36.3% (barely + clears a 35.2% noise gate; high-variance shape; the one positive signal: + fair-share keeps decodes flowing through a prefill wave). +- **WHY (analysis, recorded so it is not re-litigated):** fair-share chunked + prefill is processor-sharing; for a near-uniform prompt population it delays + every prompt's prefill completion versus run-to-completion admission + (round-robin maximizes mean completion time for identical jobs), so TTFT rises + by construction, and at N=128 the extra interleave overhead also costs + throughput. The premise that the TTFT scaling curve was "scheduler-shaped" is + hereby PARTIALLY REFUTED for GB10: the shipped decode-first budget (patch 0016) + already captures the schedulable win, and vLLM's TTFT advantage on this hardware + is dominated by its 2.6-2.8x prefill compute (buckets 1-2), not batch formation. + TTFT parity therefore routes through P3/P5 (prefill compute), not the scheduler. + Chunked-prefill fair-share may still pay on mixed long/short-prompt workloads + and on non-GB10 (host-bound) silicon; both are out of scope here. +- **CORRECTNESS GATES ALL GREEN (DGX GB10, arch sm_121a), the substantive P0 + result.** Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off): + - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`. + - **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT + 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46. + - **(c) CURSOR INTERLEAVE PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps + carry decode AND prefill tokens in the SAME batch with per-slot cursors + advancing across steps, not slot-exclusive. Verbatim step=6: `n_decode_toks=5 + n_prefill_toks=1535 n_seqs=20` with 15 partial cursors; slot s112 advances + 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs; adaptive + fair-share cap tracks live load (410@5waiting, 171@12, 137@15, 291@7, 508@4); + `dbucket==n_decode` confirms **no fixed pad-to-parallel** (per + `DECODE_SERVING_SCOPE.md` net-negative-on-GB10). + - **(b) SERVER DETERMINISM = CBv2 is NEUTRAL / correctness-preserving.** The + literal exact-reproducibility gate is unsatisfiable by ANY scheduler here: the + paged CONCURRENT greedy path is inherently non-deterministic run-to-run in the + BASELINE too (the control default scheduler diverges from itself), a pre-existing + benign near-tied-argmax / co-batch FP-reduction-order property + (`PAGED_BITEXACT_NOTE`), on both dense and MoE. The discriminating test - does + CBv2 diverge from control MORE than control diverges from itself - **PASSES**: + across 8 configs {dense,moe} x {degenerate,natural} x {gen8,gen64}, per-request + cross-arm divergence tracks the within-arm run-to-run baseline to +/-1-3 of 32 + (small-count noise; e.g. MoE-natural gen64 base 31/32 worst-cross 31/32; + dense-degenerate base 14 cross 12-17). Single-sequence greedy is fully + deterministic (the md5 gate above). +- **Implementation (kill-gate subset only; correct, committed on `p4-cbv2`, NOT + pushed; server-side only, ZERO `ggml/` files, ~68 LOC in `server-context.cpp` + + a new unit-tested header).** (1) Per-seq chunked-prefill cursors with a + **load-adaptive fair-share cap** = `ceil(prefill_leftover / n_waiting)` floored at + `LLAMA_CBV2_CHUNK_MIN` (default 128, deliberately NOT `n_ubatch` so a 512-token + prompt actually chunks under load); CBv2 activates the shipped 0016 decode-first + budget by default (`T=n_batch`, no `LLAMA_MAX_BATCH_TOKENS` needed) and replaces + 0016's fixed cap with this fair-share cap; cursor = `slot.prompt.n_tokens()` + advancing across steps. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD` + default 0 => `bucket==n_decode`, no padding; policy computed+traced only, never + fed to batch formation, so bit-exact-safe; row-emission for host-bound silicon is + the deferred [Build phase]). Pure math lives in the NEW unit-tested header + `tools/server/server-admission-policy.h` (namespace `cbv2`) + + `server-admission-policy-test.cpp` (host-side unit tests ALL PASS local + DGX); + `server-context.cpp` is the thin integration; step trace under `LLAMA_CBV2_TRACE=1`. +- **Honest delta vs expectation.** Kill-gate GO required TTFT-under-load to drop + `> 20%`; **delivered: not demonstrated** (perf A/B force-terminated control-only). + The correctness substrate (bit-exact md5, proven decode+prefill co-batching with + per-seq cursors, determinism-neutrality) is real and is the enabler the scope + values, but the perf axis that gates the phase was never measured to GO. +- **WHAT WOULD CHANGE THE VERDICT (re-score path).** Read the finalized DGX + `~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes + (the perf driver `p4_agg.py` auto-writes medians+stdev deltas with the + `> 20%`-TTFT-drop GO logic baked in). **IF** it shows a genuine `> 20%` + staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 green and aggregate not + regressed, re-score `go=true` and trigger the **full P4 build-out**: + `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt-tokens re-admit (reusing the paged + burst-reclaim patch 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom with + a constructed starvation test, preemption-transition + aging unit tests, and a + forced-preemption byte-identical-resume determinism gate. **ELSE** (the + scope-expected case) this NO-GO stands and P4 is deferred as a GB10 TTFT/fairness/ + enabler lever whose throughput payoff is non-GB10. +- **Series-numbering flag (for whoever lands a future GO).** The P0 code comments + label `[paged 0056]` per the pinned fork's next slot (46 patches), but the LocalAI + worktree README is already ahead at `0056-0061` (the MoE MMQ trace series) - + reconcile the actual series number on landing (likely `0062`). +- **Artifacts (DGX `~/bench/p4_cbv2/`):** `build_20260702_192141/` (build.log); + `gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt, + determinism tsvs); `det2_20260702_193123/` + `det3_20260702_193649/` + + `det4_20260702_194040/` (determinism diff-matrix: degenerate / natural / gen8); + `perf_20260702_194359/` (raw_*.json + auto-written RESULTS.md). Environment: + `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, + sm_121a, GPU lock held. Code on `p4-cbv2` `ebb649335`: + `tools/server/server-admission-policy.h`, `server-admission-policy-test.cpp`, + `server-context.cpp` (+68). + +### P5: FLA-faithful GDN prefill scan (blocked solve_tril port; the algorithm never actually tested in-backend) + +- **Goal:** replace the hand f32 chunked scan (`gdn_core`, 95.7 us/tok, 2.62x vLLM) with + vLLM's FLA six-kernel chunk-64 pipeline whose triangular solve is **blocked into + tensor-core matmuls**. Targets prefill bucket 1 (+59.2, 30% of the gap) - the largest + single bucket. +- **Mechanism (Audit B section 6):** port the FLA `chunk_gated_delta_rule_fwd` pipeline: + (1) `chunk_local_cumsum`, (2) `chunk_scaled_dot_kkt` (fp32 A), (3) **`solve_tril` + blocked inverse** (`merge_16x16_to_64x64_inverse`: invert 16x16 diagonal blocks with a + ~14-iteration register-resident loop, fill off-diagonal blocks with block-inverse + identity via `tl.dot` tensor-core matmuls, dropping the serial dependency length from + ~64 to ~14), (4) `recompute_w_u` (tl.dot), (5) `chunk_gated_delta_rule_fwd_h` + inter-chunk recurrence (register-resident fp32 state, chunk loop *inside* the kernel, + heads/dim-blocks parallel across the grid), (6) `chunk_fwd_o`. fp32 accumulate, + bf16 streamed operands. +- **Files:** new `gdn-blocked-solve.cu` / additions to `gated_delta_net.cu` (6 patches). +- **Env gate:** new default-off env (e.g. `LLAMA_GDN_FLA_CHUNK=1`). +- **Correctness gate:** **KL band** (fp32-accumulate but different algorithm order). +- **P0 kill-gate (gated hardest):** port the six-kernel pipeline and A/B `gdn_core` + prefill at npp512 and npp2048. **GO ONLY IF** the in-pipeline blocked solve_tril beats + the current f32 chunked scan by > 10% at npp2048 AND fits under the 99 KB smem cap AND + the KL band holds. **NO-GO** if it reproduces Phase74's standalone 0.59x (explicit + inverse slower than direct solve) - which is the **expected null** given the prior + standalone evidence, so this phase must clear the highest bar. +- **Expected recovery:** speculative. This bucket is partly a **shared-hardware floor** + (99 KB smem forces C=16; Phase74 found the blocked inverse GB10-hostile). Conservative + expected recovery is **small (~0-10 of the +59.2)**: the difference from Phase74 is + that P5 tests the *whole FLA pipeline in-backend* (register-resident state, chunk loop + in-kernel), which was never actually run in-backend - the prior bf16-C64 lever kept + our O(C^2) form-T solve, and the blocked solve was only ever benched standalone. If + the in-pipeline register-resident form behaves differently from the standalone bench, + upside is up to 59 us/tok (the single largest lever); if not, P5 is confirmed a + shared-hardware floor and recorded as such. +- **Effort:** high, high-risk. +- **Supersedes:** bf16-C64 (-18.75%) and the Phase74 standalone blocked-solve (0.59x). + **Missing prereq / difference:** neither prior test ran the full FLA chunk pipeline + in-backend with the register-resident inter-chunk scan; P5 does. This is the one lever + with a prior standalone negative, so it is ranked after the high-confidence phases and + its kill-gate is the strictest. +- **Upstream-clash / rebase-safety:** `gated_delta_net.cu` is a high-churn fork file + (6 patches) and upstream may add its own GDN paths; keep the new pipeline in a + separate `.cu` and gate the dispatch narrowly. + +#### P5 RESULT (NO-GO at the P0 perf kill-gate, recorded 2026-07-02, `LLAMA_GDN_FLA_CHUNK`, default-off) - the GDN prefill bucket is now a CONFIRMED SHARED-HARDWARE FLOOR + +The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was **ported to +CUDA tf32 mma, per-kernel validated against a host fp64 reference, integrated behind +`LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped M5 +f32 chunked scan. It **lost decisively** and by the wrong sign, so `go=false` was the +kill-gate default, **nothing was built beyond P0, and nothing landed.** This is the +**scope-anticipated "expected null"** (the P5 section framed this as the program's +strictest kill-gate given Phase74's standalone blocked-inverse 0.59x), but the phase +delivered the one thing the prior evidence lacked: **the whole FLA pipeline run +in-backend with the register/smem-resident inter-chunk state and the chunk loop +in-kernel** - the exact form that "was never actually tested in-backend." It was tested +here, and the result **settles the GDN prefill bucket (bucket 1, +59.2, the single +largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.** + +- **PERF GO GATE FAILED DECISIVELY (the decisive result).** GO required the in-pipeline + blocked `solve_tril` to beat the M5 f32 chunked scan by **> 10% at npp2048**. + Measured (nsys `--cuda-graph-trace=node`, MoE `q36-35b-a3b-nvfp4`, per distinct token + over the 30 GDN layers): **npp2048 M5 56.31 vs FLA 119.46 us/tok = FLA 2.12x SLOWER** + (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 vs FLA 117.35 = 2.29x slower**. + End-to-end **S_PP regressed MoE -13.33% @npp2048 / -13.12% @npp512** (3-rep medians; + clears `max(2%, 3 sigma)` by a wide margin, and it is the wrong sign, so there is no + 3-sigma question). The shipped M5 remains `gdn_core` at **56.31 us/tok = 64.82% of + vLLM's FLA chunk-64 36.5 us/tok on this GB10**; the rejected FLA port was only **30.55% + of vLLM** (36.5/119.46) - a regression, not a recovery. This reproduces Phase74's + standalone blocked-inverse 0.59x and extends bf16-C64 (-18.75%), now **confirmed + in-backend** with the register-resident state + in-kernel chunk loop. +- **WHERE THE TIME WENT (the novel, valuable decomposition - the reason this NO-GO + matters beyond a rejection).** Per-kernel nsys share of the FLA bucket: the **blocked + `solve_tril` is only ~2.8% (55.6 ms)** - the algorithm the whole phase was about is + *cheap*. The bucket is dominated by **`chunk_gated_delta_rule_fwd_h` 46.2% (903 ms) + + `chunk_fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs plus the + **per-chunk h-state materialization to global LPDDR5x** that FLA's split-kernel + structure forces (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 + single kernel keeps the 128x128 state **resident in smem and never materializes + per-chunk h**, so it is **2.1x faster on GB10's low-bandwidth memory.** So the novel + finding vs all prior evidence: **the blocked solve itself is not the floor - the floor + is the state-GEMM + h-materialization region, which the FLA structure makes WORSE than + M5, not better.** This is exactly the "materialize-everything tax" the scope warns of. + The binding silicon property is **memory bandwidth** (per-chunk h round-trips to + LPDDR5x), compounded by the **99 KB smem cap** that forces the FLA split (`fwd_h` and + `fwd_o` cannot co-reside), not the mma shapes or wave count. +- **SMEM GATE PASSES (all six kernels under the 99 KB opt-in cap at C=64; + `cudaOccupancyMaxActiveBlocksPerMultiprocessor`):** `k_kkt` 48 KB / 2 blk, `k_solve` + 38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk - + **max 96 KB < 99 KB.** The kernels fit; they are simply bandwidth-floored above M5. +- **KL BAND GREEN / IN-BAND (model numerics sound):** FLA `KLD 0.137028` vs control + `0.136563` = **delta +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control + (>= 84% baseline; FLA marginally better). Per-kernel bring-up validation vs host fp64 + on synthetic shapes: **o NMSE 2.2e-7, final-state 1.2e-7** (done BEFORE integration, + per the "do not debug six kernels blind" rule). +- **DEFAULT PATH UNTOUCHED (canonical md5 GREEN with the code present):** paged-MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, both + **default-off AND `LLAMA_GDN_FLA_CHUNK`-on** (the small-M greedy path bails to M5). + `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode untouched + (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence). +- **`test-backend-ops` env-on = 43-44/46 (`gdn_op_tests_env_on_green=false`; explicit + tolerance judgment).** The FLA-engaged `head_size=128, n_seq_tokens>=64` cases + marginally exceed the test's `1e-7` threshold (**ERR 1.03-1.06e-7**, fluctuating + across the boundary run-to-run) because this port uses **plain tf32** where the shipped + M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled compounding state + products; M5-chunked (`LLAMA_KV_PAGED=1`, no FLA) passes the SAME cases at `< 1e-7`. + Judgment: a marginal tf32-vs-3xtf32 accuracy gap, **benign at the model level (KL + green)**; tightening the port to 3xtf32 would only add mma count and **deepen** the + perf NO-GO, so it was not pursued. +- **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 n_seqs=N + n_tokens=128 NT=2` in `batched-bench`; nsys shows all six `gdn_fla::` kernels executing + under `LLAMA_GDN_FLA_CHUNK=1` and none under default. Protocols honored: GPU lock held + throughout and released; `LLAMA_MAX_BATCH_TOKENS` unset; sm_121a; nsys + `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external contention. +- **Provenance.** WIP on the DGX fork topic branch `p5-fla-gdn` at + `2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT + pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch + in `gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched + at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic + branches `p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the + DGX `~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt, + spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt, + `kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and + `standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu, + m5_kernel_body.cuh). +- **Honest delta vs the +59.2 expectation.** The scope's conservative expected recovery + was **~0-10 of the +59.2, "likely a shared-hardware floor."** Delivered: **0 recovery, + a -63 us/tok regression on the FLA arm**; the floor is **confirmed**. The shipped M5 + fused smem-resident chunked scan (56.31 us/tok) is the winner and is **at or near the + GB10 memory-bandwidth floor for this op.** This closes the last speculative prefill + lever in the program. What binds is silicon (LPDDR5x bandwidth on the per-chunk h + round-trip + the 99 KB smem cap forcing the split), not the algorithm; it lifts only on + datacenter Blackwell (HBM + larger smem + TMEM), consistent with section 4's framing. + +### P6: FP8 KV cache + smaller dtype/bandwidth items + +- **Goal:** halve decode-time KV cache traffic (K/V stored fp8-e4m3 with a scale) and + pick up remaining small dtype/bandwidth wins (FP8 projections where accuracy allows, + matching vLLM's bf16-proj +13.7 bucket). +- **Mechanism (Audit B section 3):** fp8-e4m3 KV with per-tensor (or per-head) scales, + loaded/calibrated (not dynamic-per-step); optional FP8 projections at the linear + boundary keeping the residual stream bf16. +- **Files:** KV cache dtype path in `llama-kv-cache.cpp` (7 patches) + `paged-attn.cpp` + (5 patches); FP8 proj in the fork GEMM files. +- **Env gate:** new default-off env (e.g. `LLAMA_KV_FP8=1`). +- **Correctness gate:** **KL band** (fp8 KV changes attention numerics; nearly free in + accuracy per vLLM). Precision is **per-path**: validate paged vs non-paged separately. +- **P0 kill-gate:** enable fp8 KV; A/B decode t/s + KLD at N >= 128. **GO** if decode + t/s + >3% with KLD in band. **NO-GO** if KLD out of band or throughput flat. +- **Expected recovery:** decode bandwidth on the KV read; part of bucket-4 bf16-proj + (+13.7 prefill) via FP8 projections. +- **Effort:** medium. +- **Supersedes:** nothing rejected; additive bandwidth item. +- **Upstream-clash / rebase-safety:** `llama-kv-cache.cpp` is high-churn (7 patches); + keep the fp8 path additive and gate the dtype selection narrowly. + +#### P6 RESULT (NO-GO at the measured Stage-0b proxy, recorded 2026-07-02, `LLAMA_KV_FP8` never built) - fp8/quant KV is a decode-THROUGHPUT NO-GO on GB10 hybrid-GDN; the measured decode ceiling + the Q8_0 A/B proxy are the load-bearing artifacts + +Sixth and final phase of the additive program, and the **retry that unblocked** the prior +BLOCKED-ON-INFRA attempt. The DGX/GB10 (`ssh dgx.casa`, host `promaxgb10-4ad8`) was +reachable for the whole window, so **Stage 0a** (the measured nsys +`--cuda-graph-trace=node` decode ceiling) **ran**, and the decisive **Stage 0b** question +was answered by a **zero-code Q8_0-KV A/B proxy** (existing `-ctk/-ctv q8_0`) instead of +building the e4m3 kernel. **Verdict: NO-GO for the throughput lever; nothing was built +beyond the unmodified measurement worktree.** Per the methodology rule (measure the +cheapest disproof first), Q8_0 KV is the *favorable* quant path - it wins on the integer +DP4A fattn-vec dot that e4m3 cannot use - so a flat/negative Q8_0 A/B at the highest-ceiling +shape is a definitive disproof for e4m3 too, and the e4m3 build was correctly not funded. +`go=false` at the Stage-0b perf gate; `stopped_at_ceiling=false` because the *measured* +ceiling does NOT kill the lever (it survives at long context) - **the null does.** The fork +`localai-paged` HEAD is **untouched at `653bb2f3d`**; the topic branch `p6-fp8-kv` (base +`653bb2f3d`, the byte-identical measurement worktree) is **retained on the DGX, NOT pushed**; +the LocalAI series stays at **46 patches (`0001-0055`)**. This is a scope-anticipated +outcome: lever-map B2 flagged fp8-KV as "gain medium-high for long-context/high-concurrency, +watch long-context recall," and the measurement confirms the *ceiling* is real at long ctx +but is **not realizable** on the fa/paged-attn path. + +- **STAGE 0a: THE MEASURED DECODE CEILING (durable artifact; supersedes the prior + analytical estimates).** Method: the v1 difference-of-run-totals estimator was + noise-dominated (each run is dominated by a ~29 s prefill whose run-to-run variance swamps + the 48-step decode delta -> `NEG-DIFF`/`INDETERMINATE`). The v2 estimator + (`p6_ceiling_v2.py`) isolates decode **per-kernel**: for every kernel it compares instance + **count** and total time between the `ntg16` and `ntg64` runs and keeps only kernels whose + count **grows** with `ntg` (decode kernels); fixed-count prefill kernels are excluded + entirely, so their variance never enters. Cross-check: the reconstructed GPU-steady decode + step matches the batched-bench wall `t_tg`/iter to within 0.3% (e.g. dense ctx8192: + 116 297 us GPU-step vs 115 969 us wall), validating the isolation. fp8-e4m3 halves the KV + bytes, so the **theoretical-MAX decode saving = 0.5 x fa_KV-read_share** (perfect BW + halving, zero dequant cost). Both models, paged (`LLAMA_KV_PAGED=1`), sm_121a: + + | shape (per-seq ctx x npl) | GPU decode step (us/iter) | flash-attn (us) | fa% of step | fp8-KV ceiling, fa-only | fp8-KV ceiling, fa+gather | + |---|---:|---:|---:|---:|---:| + | moe std ctx512 x128 | 168 397 | 7 108 | 4.2% | **+2.16%** | +3.27% | + | dense std ctx512 x128 | 354 892 | 23 628 | 6.7% | **+3.44%** | +4.11% | + | moe ctx4096 x8 | 39 945 | 2 999 | 7.5% | **+3.90%** | +5.74% | + | dense ctx4096 x8 | 106 672 | 9 767 | 9.2% | **+4.80%** | +5.66% | + | moe ctx8192 x8 | 43 354 | 5 786 | 13.3% | **+7.15%** | +10.28% | + | dense ctx8192 x8 | 116 297 | 18 836 | 16.2% | **+8.81%** | +10.48% | + + The fa-only column is the honest ceiling (the paged block-table gather is index math, not + KV bytes fp8 halves); fa+gather is a looser upper bound. **Best ceiling +8.81%** (dense, + ctx8192). Long context is the only regime where the ceiling clears the +3% GO bar; the + standard `npl128` serving shapes reach +2.2%/+3.4% (fa-only) because 128 concurrent + sequences aggregate ~74 k KV tokens even at 512 per-seq ctx. + +- **THE ANALYTICAL PRIOR IS PARTIALLY REFUTED BY MEASUREMENT (why we measured).** The + pre-run estimate (from `VLLM_PARITY_FINAL.md` 2b, a single-stream ctx256 decomposition) + put standard shapes at a 0.65% hard-NO and ctx8192 at +17.34%. The measurement disagrees + in **both** directions: standard *serving* (npl128) is higher than 0.65% (fa share is 4-7%, + not 1.3%, once concurrency aggregates KV), and long-ctx npl8 is *lower* than the estimate + (ctx8192 fa-only +8.81%, not +17.34%) because at npl8 the non-fa decode work per token is + larger (GEMM is un-amortized), diluting fa's share. This is exactly why rule #5 + (measure-don't-assume) is in force: the analytical ceiling was wrong by ~2x at both ends. + +- **STAGE 0b: THE MEASURED Q8_0-KV A/B PROXY (the decisive kill).** At the two + highest-ceiling shapes (ctx8192 x npl8, both models), 5 reps/arm, paged decode `t_tg`, + gate = clear `max(2%, 3 sigma)` (`sigma` 0.08-0.22% same-binary): + + | shape | f16-KV decode `t_tg` (median) | Q8_0-KV decode `t_tg` (median) | decode-throughput delta | vs the +7-8.8% ceiling | + |---|---:|---:|---:|---| + | dense ctx8192 x8 | 7.305 s | 7.280 s | **+0.37%** (marginal, ~flat) | captures ~4% of the +8.81% ceiling | + | moe ctx8192 x8 | 2.740 s | 2.814 s | **-2.63% REGRESSION** | the null repeats | + + So even Q8_0 - the quant path with the **favorable** integer DP4A dot - realizes + essentially **none** of the measured +7-8.8% ceiling on dense (flat +0.37%) and + **regresses -2.63%** on MoE. The dequant-in-attention cost eats the KV-read BW saving, + exactly as the historical **Q8_0 = +7.8% decode regression** (2026-06-23, dense-32B + all-attention era) predicted, now re-confirmed on hybrid-GDN at the most favorable shape. + +- **WHY e4m3 IS STRICTLY WORSE THAN Q8_0 (the structural kill; no e4m3 build needed).** + Reading the ggml `fattn-vec` kernels: the fast quant-KV path (`vec_dot_fattn_vec_KQ_q8_0`) + wins via an **integer DP4A dot** (int8 x int8). An e4m3 KQ path **cannot** use DP4A - it + must dequant e4m3 -> float then do a float dot, which is strictly **more** expensive than + Q8_0's integer dot. e4m3's only theoretical edge (cheaper hw-convert dequant on the value + read) does not touch the KQ product, which is where Q8_0 already lands flat/negative. + Therefore e4m3 KV is architecturally disadvantaged relative to the already-null Q8_0, and + the measured Q8_0 A/B is a **definitive** disproof for e4m3 on this path. Building the + e4m3 kernel to re-confirm a stronger negative was correctly not funded. + +- **HYBRID-GDN STRUCTURAL CAP (why the ceiling is bounded at all).** q36 is hybrid GDN: + **only 10 of 40 layers are full attention with a KV cache**; the other **30 are GDN** with + a fixed-size recurrent state and **no KV** (state does not grow with context). fp8 can only + touch the 10/40 KV slice - it cannot move the 30 GDN layers at all - which is why flash-attn + is a small decode fraction even at ctx8192 and the ceiling tops out at +8.81%. + +- **CAPACITY-PLAY FRAMING (this remains OPEN).** As a **throughput** lever fp8/quant KV is a + measured NO-GO. As a **memory/capacity** feature it is a different, un-run gate: storing the + 10/40 attention layers' KV as e4m3 (8-bit) instead of f16 (16-bit) halves those layers' KV + footprint - a real long-context / high-concurrency **capacity** win (more sequences or + longer contexts per fixed VRAM) independent of any t/s delta. That gate is **footprint, not + throughput**, and was not P6's kill-gate. Note the Q8_0 proxy already demonstrates the + footprint path is *functional* on the paged binary today (`-ctk/-ctv q8_0` runs correctly, + n_kv fills as expected) at a small/zero decode cost on dense. **fp8-KV as a capacity feature + stays open for a future capacity-motivated effort even though it is throughput-flat.** + +- **DEFAULT PATH: MEASURED GREEN (not merely provable-by-zero-diff).** The P6 worktree is + byte-identical to `653bb2f3d` (0 dirty files), and the canonical greedy-md5 gate was + **re-run this session** on that binary and passed both models, paged: MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`. No P6 code + exists, so there is provably zero overlap with P3's `w4a16*`/`mmq*` files. + +- **Provenance.** Fork `localai-paged` HEAD **untouched at `653bb2f3d`** (verified: `git + rev-parse localai-paged` = `653bb2f3d`); topic branch `p6-fp8-kv` retained on the DGX at + `653bb2f3d` (base = the unmodified measurement worktree), **NOT pushed**; LocalAI series + stays at **46 patches (`0001-0055`)**; P3's `p3-w4a16-direct` (`8eef7ba43`, WIP NO-GO on + its own branch, not landed to `localai-paged`) is **untouched**. Artifacts on the DGX + under `~/bench/p6_fp8_kv/`: `ceiling_20260702_215535/` (Stage 0a nsys `.nsys-rep`/`.sqlite` + + `kern.csv` for 6 shapes, verified KV occupancy), `q8proxy_20260702_223414/` (the 20-rep + Q8_0 A/B raws + `ab.log`), `md5gate/` (the re-run canonical md5 outputs), and the runners + `p6_ceiling_v2.py` (the per-kernel decode-isolation estimator) + `p6_q8proxy_ab.sh`. The + build worktree is `~/llama-paged-p6` (branch `p6-fp8-kv`, sm_121a, 0 dirty). + +- **HANDOFF (only if the capacity feature is later funded).** The throughput lever is a + measured NO-GO - do not re-run it on GB10. If a future effort wants the **capacity** win: + (1) the storage path already works (`-ctk/-ctv q8_0`/e4m3 on the paged binary); wire + `LLAMA_KV_FP8=1` to select e4m3 `type_k/type_v` at `llama_init_from_model`, gated per-path; + (2) gate on **footprint** (bytes/seq at fixed VRAM) and **KL** (per-path, paged AND + non-paged, both models, KLD delta < 0.01 + same-top-p >= 84%), NOT on t/s; (3) expect + throughput-flat-to-slightly-negative on the decode path per this record. The datacenter- + Blackwell pivot (HBM, native tcgen05) is where the KV-BW lever inverts, per the program + conclusion. + +--- + +## 4. Program-level arithmetic (if all phases land) + +> **SUPERSEDED (2026-07-02).** This subsection is the *pre-execution projection* ("if all +> phases land"). The program has now run end-to-end and only **P1 landed** (P2/P3/P4/P5 +> rejected, P6 blocked-on-infra). The measured reality is in **section 4a +> (PROGRAM CONCLUSION)** below; read it for the real numbers. This projection is kept for +> provenance - to show what was expected and by how much reality diverged. + +Conservative, showing the math. Baselines from section 2. + +**Prefill (MoE decision model, paged 395.9 us/tok, vLLM 197.0, gap 198.9):** + +| Bucket | delta | phase | conservative recovery | MEASURED | +|---|---:|---|---:|---| +| 3 dtype boundary tax | +36.6 | P1 | ~30 | **~8.4 us/tok @512** (P1 LANDED, projection-boundary portion only) | +| 4 norms/glue (part) | +37.2 | P1 (norms) + P6 (FP8 proj) | ~18 | **norms in P1's segment; P6 FP8-proj BLOCKED-ON-INFRA** | +| 2 GEMM tiling | +56.5 | P2 + P3 (NO-GO, CONFIRMED FLOOR) | ~~40~~ **0** | **0** - P2 flat (layout-only), P3 -48/-49% (bf16=half-FP4 peak); FP4-MMQ optimal | +| 1 GDN scan | +59.2 | P5 (NO-GO, CONFIRMED FLOOR) | 0 | **0** - M5 fused smem-resident scan is the GB10 BW floor; FLA 2.12x slower | +| 5 dispatch | +5.9 | P2/P4 (both NO-GO) | ~~3~~ **0** | **0** - both levers rejected | + +Recovered ~91-101 us/tok of 198.9. New paged wall ~295-305 us/tok. **Prefill S_PP goes +from 36% to ~55-65% of vLLM** (throughput ratio 197/300 ~= 66% best case, ~55% +conservative). Roughly a doubling. **What remains unreachable:** the GDN-scan 2.62x +residual (bucket 1: shared-hardware floor of 99 KB smem forcing C=16 + the GB10-hostile +blocked inverse) and the bf16-vs-FP4 peak ratio ceiling on the GEMM (FP4-MMQ already +optimal). Full 100% prefill parity requires datacenter Blackwell (tcgen05 + HBM + TMEM). + +**Serving aggregate (llama server 718 t/s = 60.7% of vLLM server 1177; vLLM true +GPU-steady 1078):** + +- ~8 pt is vLLM measurement inflation (not ours to recover; it means the honest target + is 1078, not 1177). +- ~17 pt scheduler/graph-reuse: P4 + S3 recover ~10 pt on GB10 (host-loop is + GB10-compute-bound, so P4's throughput payoff here is bounded; the rest is TTFT). +- ~14 pt GPU-steady kernel residual: P2+P3 (MoE fused-Marlin ~11 ms) + P1 (Triton + elementwise ~10 ms) recover ~10-12 pt. + +llama server goes ~60.7% -> **~80-83% of vLLM server** (~87-90% of vLLM's true +GPU-steady). Decode GPU-steady is already 86% of true; P1+P2+P3 close most of the 14 pt +residual to **~95%+ of vLLM's true GPU-steady**, with low-N dense already leading +(116.7% at N=8). + +**TTFT:** P4 (continuous batching + chunked prefill co-batching decode) plus the prefill +gains (P1/P2/P3) target the 3.4x TTFT gap. Conservative: TTFT gap closes from ~3.4x to +~1.5-2x under load. It is bounded below by prefill throughput, which the program roughly +doubles. + +**What stays unreachable and why:** (1) the GDN recurrent-scan bandwidth plateau (shared +hardware, and paged already leads); (2) the C=16-forcing 99 KB smem cap on the GDN solve +(joint algorithm+hardware); (3) the bf16 = half-FP4 tensor-core peak on sm_121. These are +the genuine floors; they lift only on datacenter Blackwell, not on GB10. The program's +honest ceiling on GB10 is roughly **prefill ~55-65%, serving-agg ~80%, decode-GPU-steady +~95%, TTFT within ~2x** of vLLM - a large closure of the current 2-3x, not 100% parity. + +### 4a. PROGRAM CONCLUSION (measured, 2026-07-02) - the projection above is corrected to reality + +The additive program has run end-to-end. **Six phases were gated; exactly one landed.** +This subsection records what actually happened and corrects the section-4 projection to the +measured reality, so the doc ends truthful. + +**Phase outcomes (all RESULTS above):** + +| Phase | Lever | Verdict | Net recovery | +|---|---|---|---:| +| P1 | bf16-native residual-segment executor (`LLAMA_BF16_STREAM`) | **LANDED** (default-off), 3 fork commits -> `653bb2f3d`, series `0053-0055` | **+2% MoE prefill @512** (~8.4 us/tok; bucket-3 projection boundary) | +| P2 | expert-major fused MoE region (`LLAMA_MOE_REGION_EXECUTOR`) | NO-GO (flat + 0-engagement on q36's separate-gate/up shape) | 0 | +| P3 | W4A16 direct-A Marlin GEMM (`LLAMA_W4A16_DIRECT_A`) | NO-GO (-48/-49%; slower than grouped) | 0 | +| P4 | continuous-batching scheduler (`LLAMA_CONTINUOUS_BATCH_V2`) | NO-GO (TTFT regresses; not a GB10 throughput lever) | 0 | +| P5 | FLA-faithful GDN prefill scan (`LLAMA_GDN_FLA_CHUNK`) | NO-GO (FLA 2.12x slower than M5) | 0 | +| P6 | fp8-e4m3 KV cache (`LLAMA_KV_FP8`) | NO-GO (measured: Q8_0-KV proxy flat/regresses at the highest-ceiling shape; throughput-only) | 0 | + +**The completed prefill story - which buckets are confirmed floors, and by what evidence.** +Of the five prefill buckets (gap 198.9 us/tok, MoE decision model): + +- **Bucket 1 (GDN scan, +59.2) = CONFIRMED SHARED-HARDWARE FLOOR (P5).** The whole FLA + pipeline in-backend (register/smem-resident inter-chunk state, chunk loop in-kernel) ran + **2.12x slower** than the shipped M5 fused scan (119.46 vs 56.31 us/tok @npp2048, S_PP + -13.3%). Per-kernel nsys: the blocked `solve_tril` is only ~2.8% of the bucket; the floor + is the state-GEMM + per-chunk h-materialization to LPDDR5x that FLA's split-kernel + structure forces (+ the 99 KB smem cap forcing that split). M5 is at/near the GB10 + memory-bandwidth floor. +- **Bucket 2 (GEMM tiling, +56.5) = CONFIRMED FP4-MMQ-OPTIMAL FLOOR (P2 + P3).** P2 (compact + expert-major layout) was flat on its sentinel and engaged 0x on q36. P3 (W4A16 direct-A, + the forensics-informed retry) removed the integration tax the retry hypothesis blamed + (act-quant 18.92 -> ~0, host expert-sort + src1-gather + separate cast eliminated) and + **still lost -48/-49%**. nsys graph-node decomposition: the mature bf16 grouped-W4A16 GEMM + = 323.90 us/tok = **1.97x** the FP4-MMQ int8 GEMM (164.6) = exactly the **bf16 = half + int8/FP4 tensor-core peak ratio on sm_121**. FP4-MMQ is optimal; the ceiling is silicon. +- **Buckets 3+4 (dtype boundary + norms/glue, +73.8) = PARTIALLY RECOVERED (P1) / NO-GO + (P6).** P1 landed the bf16-native residual-segment executor and recovered the + **projection-boundary portion of bucket 3** (~8.4 us/tok @512, ~+2% on the MoE model; + dense is a no-op because its projections are NVFP4, not BF16, so nothing engages). The + norms live inside P1's owned segment; the remaining glue and the FP8-projection portion of + bucket 4 were P6's target, which measured NO-GO (the KV-dtype half of P6 is a measured + throughput dead end; the FP8-projection half was gated behind it and never reached). +- **Bucket 5 (dispatch, +5.9) = 0** (P2/P4 both rejected). + +**What the program actually recovered.** **P1's ~8.4 us/tok @512 on the MoE model (+2%), +~4.0 @2048** - the bucket-3 projection boundary, KL-benign (in fact KL-improving), safe, +default-off. Nothing else moved. + +**Corrected closure numbers (replacing the projection above):** + +- **Prefill: ~50-51% of vLLM, NOT ~55-65%.** The projection assumed all phases land and + recover ~91-101 of the 198.9 us/tok gap (new wall ~295-305, "roughly a doubling"). + Measured: only P1 landed, recovering **~8.4 us/tok** of the gap (new MoE wall ~387.5), + so prefill throughput moves from ~49.8% (197.0/395.9) to **~50.8%** (197.0/387.5) of + vLLM - a **+2% relative MoE improvement, not a doubling**. The projected doubling was + falsified because the two largest buckets (1 + 2 = +115.7 of the 198.9 gap) are now + **confirmed silicon/bandwidth floors on GB10**, not recoverable levers. +- **Serving-aggregate: stays ~60.7% of vLLM server, NOT ~80-83%.** The ~10 pt scheduler + recovery was P4, now REJECTED (CBv2 regresses TTFT on GB10; the host-loop-dead measurement + is real). The MoE-GEMM (P2+P3) and its ~10-12 pt decode-residual recovery were REJECTED. + So the in-backend serving-agg recovery on GB10 is ~0; the ~80% figure was contingent on + levers that did not land. +- **Decode-GPU-steady: stays ~86% of vLLM's true GPU-steady, NOT ~95%.** The 14 pt residual + was to be closed by P1+P2+P3 kernel wins; P2/P3 rejected and P1 is a prefill lever + (decode M<128 bails). Low-N dense already leads (116.7% at N=8); that standing result is + unchanged. The ~95% target required the rejected GEMM levers. +- **TTFT: stays ~3.4x, NOT ~1.5-2x.** P4 was the TTFT lever and it *regressed* TTFT + (fair-share chunked prefill is processor-sharing; patch 0016's decode-first budget already + captures the schedulable win). TTFT parity routes through prefill compute, which is now + floored. It does not close in-backend on GB10. + +**What remains (small / non-GB10):** + +- **P6 FP8-KV (small, MEASURED NO-GO for throughput).** The retry ran the kill-gate: the + measured decode ceiling (v2 per-kernel isolation) tops at **+8.81% fa-only at ctx8192 x8** + and clears +3% only at long ctx, but the **zero-code Q8_0-KV A/B proxy** at that exact + highest-ceiling shape is **flat on dense (+0.37%) and regresses on MoE (-2.63%)** - the + dequant-in-attention cost eats the KV-read BW saving. Since e4m3's KQ path is strictly + worse than Q8_0's integer DP4A dot, e4m3 is a definitive throughput NO-GO and was not + built. (This also refutes the earlier *analytical* 0.65% standard-shape estimate in both + directions - see the P6 RESULT.) The **capacity-play framing stays open** (halving stored + KV bytes for the 10/40 attention layers is a real long-ctx / high-concurrency capacity win, + independent of throughput) for a future capacity-motivated effort. +- **Non-GB10 portability of the P4/P5 artifacts.** P4's CBv2 scheduler has a genuine + throughput payoff on **host-bound (non-GB10) silicon** where decode goes host-loop-limited + again; it is TTFT/fairness/enabler-only on GB10. The datacenter-Blackwell pivot + (tcgen05 + HBM + TMEM) is where buckets 1+2 lift: native CUTLASS grouped-FP4 removes the + bf16-peak ceiling (bucket 2) and larger smem + HBM removes the GDN split + per-chunk h + round-trip (bucket 1). Also carried: P1's `LLAMA_BF16_CUBLAS_F32_OUT` plank and the 0034 + FP4-MMA kernel are portable-with-prereqs. + +**Reconciliation with the standing program conclusion.** This end-to-end result **confirms +and strengthens** the standing conclusion (`VLLM_PARITY_FINAL.md`, `PARITY_HANDOFF.md`) that +**GB10 throughput-parity is unreachable by exhaustion.** The prefill story is now complete: +its two largest buckets are confirmed floors by direct in-backend experiment (not +assumption), the recoverable software tax was the ~5% bucket-3 boundary (P1 captured the +~2% MoE projection-portion of it), and the binding ceilings - **LPDDR5x bandwidth on the GDN +per-chunk h round-trip, the 99 KB smem cap forcing the GDN split, and bf16 = half-FP4/int8 +tensor-core peak on sm_121** - are **silicon, lifting only on datacenter Blackwell**. The +honest measured closure on GB10 is: **prefill ~50-51%, serving-agg ~60.7%, decode-GPU-steady +~86% (low-N dense leading), TTFT ~3.4x** of vLLM - i.e. the paged fork's **precision** parity +and memory advantage stand (see `VLLM_PARITY_FINAL.md`), while **throughput** parity is +GB10-hardware-bound. Default path untouched throughout; canonical md5s green +(MoE `8cb0ce23`, dense `5951a5b4`); series 46 patches; fork `localai-paged` HEAD `653bb2f3d`. + +--- + +## 5. Execution rules (non-negotiable) + +1. **Fork-first, always.** `mudler/llama.cpp:localai-paged` is canonical. Commit+push the + fork branch FIRST, THEN regenerate the LocalAI patch series via `git format-patch` + (1:1 tree-hash mirror). Never edit the series directly or add a patch with no fork + commit (drift caused the build-broken 0044/0045). See + [`PATCH_MAINTENANCE.md`](PATCH_MAINTENANCE.md). +2. **Per-path correctness gate.** Math-preserving change -> **per-path greedy md5** + (canonical MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`; paged md5 != non-paged md5 by design). + Dtype/algorithm-changing change -> **KL band** (same-top-p >= the recorded baseline, + KLD not worse than the current path; see [`PAGED_BITEXACT_NOTE.md`](PAGED_BITEXACT_NOTE.md)). + Never force the md5 gate on a bf16/fp8 path. +3. **Noise-floor promotion rule.** Keep a lever only if its **median** improvement + exceeds **max(2%, 3 sigma)** over the control medians. Flat-within-noise is a reject. +4. **Decode profiling MUST use `--cuda-graph-trace=node`.** Without it, nsys collapses + each replayed decode graph into one opaque launch and reports a false "host-bound + ~16% GPU busy" artifact (this is the mislabel that produced the retired ~56% headline; + the true number is ~86%). +5. **One lever per A/B.** A standalone PoC win is **not** a result; gate on a + separately-built in-backend A/B with only that lever changed. 0034 won as a PoC + (57.7% FP4 peak, NMSE=0) and lost in-backend; that is the rule's origin. +6. **Record every rejected lever** in [`PARITY_HANDOFF.md`](PARITY_HANDOFF.md) with the + DGX artifact path, the numeric result, and the mechanism verdict (integration tax vs + kernel-intrinsic vs shared-hardware floor). The rejected-lever log is load-bearing: + it is what prevents re-litigating a floor. + +--- + +## 6. Risks and open questions + +- **P5 is a shared-hardware floor - RESOLVED / CONFIRMED (2026-07-02, see the P5 RESULT + above).** Phase74's standalone blocked-inverse ran at 0.59x the direct solve. The open + question was whether the full FLA pipeline *in-backend* (register-resident inter-chunk + state, chunk loop in-kernel) behaves differently from the standalone bench. **Answer: + no - it is 2.12x SLOWER than M5 at npp2048 (119.46 vs 56.31 us/tok), S_PP -13.3%.** The + per-kernel decomposition showed the blocked solve is only 2.8% of the bucket; the floor + is the state-GEMM + per-chunk h-materialization to LPDDR5x that FLA's split-kernel + structure forces (and the 99 KB smem cap that forces that split). P5 recovers 0 and is a + **confirmed shared-hardware / memory-bandwidth floor.** +- **P1 segment-boundary converts.** Option A keeps f32 at segment edges; if the q36 + residual stream has many short segments, the boundary converts could eat the win. + Open: how many bf16 segments survive across a q36 layer, and does the shared-expert + path fork the stream? +- **P2/P3 all-or-nothing + aliasing - RESOLVED / CONFIRMED FLOOR (2026-07-02, see the P2 + and P3 RESULTs above).** Both levers ran and both are NO-GO: P2 (compact expert-major + layout) is flat on its sentinel and engages 0x on q36's separate `ffn_gate_exps`/ + `ffn_up_exps` + `ggml_swiglu_split` shape (the merged whole-pattern matcher never fires); + P3 (W4A16 direct-A) removed the integration tax the retry blamed and **still lost + -48/-49%** because the mature bf16 W4A16 GEMM is 1.97x the FP4-MMQ int8 GEMM (bf16 = half + int8/FP4 tensor-core peak on sm_121). **Bucket 2 (GEMM tiling, +56.5) is a confirmed + FP4-MMQ-optimal floor on GB10**, joining bucket 1. The aliasing caution stands for any + future re-scope of the seam to q36's separate/scaled shape (the prerequisite handoff in + the P2 RESULT), but it is no longer an open program risk - the lever is closed. +- **CUDA-graph capture safety.** Region-executor pool allocs must be shape-stable across + replays (keyed on n_tokens/n_experts, never on data-dependent routing counts) or they + force re-capture and negate the graph-reuse win. Dovetails with S1 (patch 0040). +- **Rebase risk concentration.** `ggml-cuda.cu` (8 patches), `mmq.cu` (5), `ggml.c`/`.h` + (5 each), `llama-kv-cache.cpp` (7), `gated_delta_net.cu` (6) are exactly the files + upstream churns for fusion/MoE. Mitigation is the series discipline: new `.cu` files, + narrow additive `ggml_can_fuse` clauses, no new ggml tensor types, re-baseline md5 on + every pin bump (weekly canary). +- **P4 is throughput-neutral on GB10.** Its measured value there is TTFT + fairness + + enabling P2/P3; the throughput payoff is on non-GB10 silicon. Risk: over-investing in + P4 as a GB10 throughput lever. Scope it as the enabler it is. +- **Datacenter-Blackwell dependency.** The program targets ~55-80% closure on GB10, not + 100%. The residual floors (GDN scan BW, C=16 smem cap, bf16=half-FP4 peak) lift only on + tcgen05 + HBM + TMEM silicon. Do not promise GB10 parity. +- **Upstream may solve pieces for us.** PR #11867 (overlap graph build with processing) + serves P4 on non-GB10; `GGML_CUDA_GRAPH_OPT` streams serve P3; PR #16016 (deterministic + MoE mul_mat_id) could shift our recorded md5s (keep the per-path gate, re-baseline on + pin bump). Align, do not duplicate. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md new file mode 100644 index 000000000000..928c07a9440b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md @@ -0,0 +1,3874 @@ +# GB10 Parity Phase 0 Results + +Status: in progress. + +## Preflight + +- DGX host: `promaxgb10-4ad8` +- Docker containers: `none` +- GPU compute apps: `none` +- GPU lock owner: `FREE released-by-claude-fp4norm-profile 1782828229` +- LocalAI worktree SHA: `d288a0300f36f7c126d62d997809bb03f297a3ac` +- Local llama.cpp fork SHA: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` +- DGX artifact directory: `~/bench/reopen_phase0` + +## Baseline Runs + +Clean prefill baseline artifacts: + +- MoE: `~/bench/reopen_phase0/paged_moe_prefill.txt` +- Dense: `~/bench/reopen_phase0/paged_dense_prefill.txt` + +MoE paged prefill: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 7.181 | 2281.66 | 0.355 | 360.57 | 7.536 | 2191.16 | +| 2048 | 4 | 32 | 65664 | 27.131 | 2415.53 | 0.328 | 390.84 | 27.459 | 2391.38 | + +Dense paged prefill: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 16.749 | 978.18 | 0.842 | 152.03 | 17.591 | 938.64 | +| 2048 | 4 | 32 | 65664 | 63.791 | 1027.35 | 0.687 | 186.29 | 64.479 | 1018.38 | + +## Decode Difference-Method Reproduction + +Paged llama.cpp artifacts: + +- `~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.nsys-rep` +- `~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg16.bench.log` +- `~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.nsys-rep` +- `~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg64.bench.log` + +Paged llama.cpp rows: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 128 | 16 | 256 | 36864 | 14.933 | 2194.39 | 4.502 | 909.80 | 19.435 | 1896.81 | +| 128 | 64 | 256 | 49152 | 14.949 | 2191.96 | 17.924 | 914.09 | 32.873 | 1495.21 | + +Paged difference-method decode: + +- Token delta: `256 * (64 - 16) = 12288` +- Wall delta: `17.924 - 4.502 = 13.422 s` +- Decode throughput: `915.51 t/s` + +vLLM artifacts: + +- `~/bench/reopen_phase0/vllm_decode_nsys/vllm_version.txt` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.nsys-rep` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.run.log` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.kern.csv` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg16.gpu_trace.csv` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.nsys-rep` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.run.log` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.kern.csv` +- `~/bench/reopen_phase0/vllm_decode_nsys/dec_npl256_ntg64.gpu_trace.csv` + +vLLM version: `0.23.0` + +vLLM profiled rows: + +| NSEQ | GEN | Generated tokens | Wall s | Logged tok/s | +|------|-----|------------------|--------|--------------| +| 256 | 16 | 4096 | 6.195 | 661.2 | +| 256 | 64 | 16384 | 17.607 | 930.5 | + +vLLM difference-method decode: + +- Token delta: `16384 - 4096 = 12288` +- Wall delta: `17.607 - 6.195 = 11.412 s` +- Decode throughput: `1076.76 t/s` + +Clean reproduced paged/vLLM decode ratio: `85.0%`. + +## W4A16 Kill-Gate Baseline + +Artifacts: + +- Default FP4-MMQ: `~/bench/reopen_phase0/w4a16_off.txt` +- Forced W4A16 with debug: `~/bench/reopen_phase0/w4a16_on_thr64.txt` +- Forced W4A16 without debug: + `~/bench/reopen_phase0/w4a16_on_thr64_nodebug.txt` + +Default FP4-MMQ: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 7.105 | 2306.06 | 0.321 | 399.00 | 7.426 | 2223.68 | +| 2048 | 4 | 32 | 65664 | 27.047 | 2423.00 | 0.329 | 388.89 | 27.377 | 2398.55 | + +Forced W4A16, `LLAMA_W4A16_PREFILL_M=64`, debug off: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 12.517 | 1308.92 | 0.321 | 398.82 | 12.838 | 1286.17 | +| 2048 | 4 | 32 | 65664 | 49.165 | 1332.98 | 0.330 | 387.57 | 49.495 | 1326.67 | + +Delta: + +- `npp=512`: `-43.2%` S_PP versus default FP4-MMQ. +- `npp=2048`: `-45.0%` S_PP versus default FP4-MMQ. + +Debug evidence: + +- Forced W4A16 debug run emitted `19200` engagement lines. +- Observed `n_tiles` range: `139..282`. +- Observed `multi_tile_experts` range: `7..21`. + +First implementation target: + +- Option B: device-side or cached tile metadata. +- Rationale: `w4a16-gemm.cu` currently builds `h_tile_expert`, + `h_tile_row0`, and `h_tile_rows` on the host, pool-allocates three device + tile-map buffers, and issues three H2D `cudaMemcpyAsync` calls per grouped + W4A16 launch. The debug run shows this path is repeatedly exercised across + many small ragged tile maps. The first fork-first experiment should remove or + amortize that host-built tile-map path before retuning MMA tile shapes. + +## W4A16 Metadata Phase 1 + +Fork commit: `4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17` +(`feat(paged): pack W4A16 grouped tile metadata`). + +LocalAI patch mirror: `0048-feat-paged-pack-W4A16-grouped-tile-metadata.patch`. + +Mirror invariant: applying the full LocalAI `patches/paged/*.patch` series to +base pin `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD +`4b0cc1163cc42dc1c17892fd41ce5ab384ba3e17`. + +Artifacts: + +- Diff: `~/bench/w4a16_phase1/packed_desc.diff` +- Build mtimes: `~/bench/w4a16_phase1/build_binary_mtimes.txt` +- MoE gate: `~/bench/w4a16_phase1/gate_moe.md5` +- Dense gate: `~/bench/w4a16_phase1/gate_dense.md5` +- Default FP4-MMQ: `~/bench/w4a16_phase1/w4a16_off.txt` +- Packed W4A16: `~/bench/w4a16_phase1/w4a16_on_thr64.txt` + +Canonical gates: + +- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched expected) +- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched expected) + +Packed descriptor A/B: + +| Path | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|------|----|----|---|------|--------|----------|--------|----------|-----|-------| +| FP4-MMQ | 512 | 4 | 32 | 16512 | 7.114 | 2303.07 | 0.323 | 396.55 | 7.437 | 2220.32 | +| FP4-MMQ | 2048 | 4 | 32 | 65664 | 27.045 | 2423.23 | 0.331 | 387.14 | 27.376 | 2398.64 | +| W4A16 packed | 512 | 4 | 32 | 16512 | 12.468 | 1314.08 | 0.322 | 397.97 | 12.790 | 1291.04 | +| W4A16 packed | 2048 | 4 | 32 | 65664 | 48.930 | 1339.39 | 0.330 | 387.44 | 49.260 | 1333.00 | + +Result: + +- Packed descriptors improved forced W4A16 by `+0.39%` at `npp=512` and + `+0.48%` at `npp=2048` versus the Phase 0 no-debug W4A16 baseline. +- W4A16 remains `-42.9%` at `npp=512` and `-44.7%` at `npp=2048` versus + same-run default FP4-MMQ. +- Decision: keep patch `0048` as a small simplification, but pivot the next + W4A16 iteration to the activation cast or MMA/dequant tile body. + +## W4A16 Kernel Shape Phase 2 + +Profile-guided target: + +- Phase 1 forced W4A16 profile at `npp=512`: `w4a16_grouped_kernel` dominated + at `5231.667 ms` (`47.8%`) while `w4a16_cast_act_f32_bf16` was `517.195 ms` + (`4.7%`). +- Phase 2 therefore targeted grouped-kernel tile shape/body before activation + cast fusion. + +Shape sweep artifacts: + +- Build: `~/llama-w4a16-phase2` +- Benchmarks: `~/bench/w4a16_phase2/shape_*.txt` +- Winning profile: `~/bench/w4a16_phase2/profile/w4a16_bm32_npp512.*` + +Shape A/B: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| `base` / `64x128` | 1308.02 | 1339.46 | old baseline | +| `bn256` | 1286.99 | 1311.56 | rejected | +| `bm32` / `32x128` | 1442.99 | 1475.65 | selected | +| `bn64` | 1334.80 | 1362.55 | diagnostic only | +| `stages3` | 1271.01 | 1295.96 | rejected | +| `bn256x16` | 1084.66 | 1100.95 | rejected | + +Only `bm32` and the old `base` selector are shipped in patch `0049`. The other +candidate shapes were benchmarked in the Phase 2 build and then deliberately +left out to keep the upstream conflict surface small. + +Default-verification after selecting `bm32`: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 | +| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 | + +Result: + +- `bm32` improves forced W4A16 by about `+10.4%` at `npp=512` and `+10.2%` + at `npp=2048` versus the old `64x128` shape in the same sweep. +- The profiled `bm32` grouped kernel dropped to `4107.355 ms` (`41.7%`) at + `npp=512`, from Phase 1's `5231.667 ms` (`47.8%`). +- Canonical post-change gates matched: MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 shape gates matched each other: `LLAMA_W4A16_PREFILL_M=1` + default `bm32` and `LLAMA_W4A16_SHAPE=base` both produced + `07db32c2bcb78d17a43ed18bc22705cd` on the canonical gate prompt. +- Forced W4A16 `MUL_MAT_ID` op checks passed for both shapes: + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` reported `806/806` + for default `bm32` and `806/806` for `base`. +- Decision: make `bm32` the W4A16 default shape while keeping + `LLAMA_W4A16_SHAPE=base` for old-shape A/B and leaving other candidates as + diagnostics. + +Mirror invariant after patch `0049`: + +- Applying all 40 LocalAI `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD + `7dfa0e17548c5f04f83d2cc2a057b0a9941b599a`. +- Tree hash after patch application: `dabe225efbf20ec047b8309d1e1f19b34fc7c5c9`. + +## W4A16 Scale Broadcast Phase 3 + +Goal: reduce duplicate FP4 scale conversion inside `w4a16_grouped_kernel` by +having one lane per 4-lane group convert the `ue4m3` scale and broadcast it with +`__shfl_sync`. + +Artifacts: + +- Build: `~/llama-w4a16-phase3` +- Logs: `~/bench/w4a16_phase3` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 3 scale-broadcast `bm32` | 1392.46 | 1422.74 | rejected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 3 scale-broadcast `base` | 1201.69 | 1221.25 | rejected | + +Result: + +- Rejected. No fork commit and no LocalAI patch `0050`. +- The local fork experiment was reverted. +- Do not retry this exact scale-broadcast approach; on GB10 the shuffle and/or + scheduling cost exceeds the saved duplicate scale conversion. + +## W4A16 Shared-Memory Padding Phase 4 + +Goal: reduce bank pressure in `w4a16_grouped_kernel` by padding the A operand +shared-memory row stride while preserving math order and launch shape. + +Fork commit: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` +(`feat(paged): pad W4A16 A shared tile stride`). + +LocalAI patch mirror: `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`. + +Artifacts: + +- Build: `~/llama-w4a16-phase4` +- Logs: `~/bench/w4a16_phase4` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic | + +Result: + +- Kept. Default W4A16 `bm32` improves another `+1.7%` at `npp=512` and + `+1.6%` at `npp=2048` versus Phase 2. +- Applying all 41 LocalAI `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` tree-matches fork HEAD + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. +- Tree hash after patch application: `8fcb151e0620fd0fc82b80c04318e5c34320b087`. + +## W4A16 Wq Padding Phase 5 + +Goal: test whether padding the quantized-weight shared-memory row stride gives +another low-conflict W4A16 grouped-kernel body win after `0050`. + +Artifacts: + +- Build: `~/llama-w4a16-phase5` +- Logs: `~/bench/w4a16_phase5` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` and old `base` shape md5s matched each other: + `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | baseline | +| Phase 5 Wq-pad `bm32` | 1472.36 | 1504.82 | rejected: below 1% gate | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | baseline | +| Phase 5 Wq-pad `base` | 1337.70 | 1368.48 | diagnostic | + +Result: + +- Rejected. No fork commit and no LocalAI patch was created for that experiment. +- The local fork experiment was reverted. +- Do not ship Wq padding alone; the measured `+0.4%` / `+0.6%` default-shape + gain is below the maintenance threshold. + +## Clean Build + +First clean build attempt: + +- PID: `625392` +- Source checkout: `~/llama-paged-reopen-clean` +- Result: failed during CMake configure. +- Root cause: `nvcc` was not discoverable on PATH. CUDA headers were found under + `/usr/local/cuda/targets/sbsa-linux/include`, and the compiler exists at + `/usr/local/cuda-13.0/bin/nvcc`. +- Retry plan: rebuild the clean checkout with + `CUDACXX=/usr/local/cuda-13.0/bin/nvcc`. + +Second clean build attempt: + +- PID: `631100` +- Source checkout: `~/llama-paged-reopen-clean` +- Source status: `## HEAD (no branch)` +- Build HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` +- CUDA compiler: `/usr/local/cuda-13.0/bin/nvcc` +- Result: succeeded. +- Binary mtimes: + - `build-cuda/bin/llama-server 2026-06-30 22:14:34.091312112 +0200` + - `build-cuda/bin/llama-batched-bench 2026-06-30 22:14:35.156287566 +0200` + - `build-cuda/bin/llama-completion 2026-06-30 22:14:37.095750242 +0200` + - `build-cuda/bin/test-backend-ops 2026-06-30 22:14:47.360078186 +0200` + +## Canonical Gates + +- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0` (matched expected) +- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439` (matched expected) +- Artifacts: + - `~/bench/reopen_phase0/gate_moe.txt` + - `~/bench/reopen_phase0/gate_moe.md5` + - `~/bench/reopen_phase0/gate_dense.txt` + - `~/bench/reopen_phase0/gate_dense.md5` + +## Source Provenance + +- Local llama.cpp fork: `/home/mudler/_git/llama.cpp` +- Branch: `localai-paged` +- Working tree: clean after fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` +- Phase 0 HEAD: `51168c5eee2e35348d9006f0b2fab3dc6e7c01cc` +- Current HEAD: `cd56cf037379b084d6bb0ed47db8b785c828be86` +- Base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` +- Merge-base with base pin: `0ed235ea2c17a19fc8238668653946721ed136fd` +- LocalAI patch count: `38` at Phase 0; current mirror count is `42` after + patch `0051`. +- LocalAI patch mirror: applies cleanly to the base pin and tree-matches fork + HEAD. +- Tree hash after patch application: `623b7cb008a929455ca3d9deae35494c02622fef` + +## Existing Artifact Gap Review + +Read-only DGX artifact inspection was performed after confirming the machine was +idle: `docker ps` returned no running containers, +`nvidia-smi --query-compute-apps` returned no compute-app rows, and +`~/gpu_bench_lock/owner` read +`FREE released-by-claude-fp4norm-profile 1782828229`. + +Existing paged llama.cpp decode and prefill numbers are supported by +`/home/mudler/bench/COMBINED_DEFINITIVE.txt`: MoE paged prefill lines 13-18, +MoE paged serving decode lines 23-26, dense paged prefill lines 43-48, and +dense paged serving decode lines 53-56. Supporting comparison artifacts are +`/home/mudler/bench/STOCK3WAY.txt`, `/home/mudler/bench/PREFILL_KNOB.txt`, +`/home/mudler/bench/DEFINITIVE_S3ab.txt`, and the adjacent raw logs. + +No self-contained vLLM `1078 t/s` GPU-steady `ntg16`/`ntg64` +difference-method artifact was found. The available vLLM evidence is +serving-run output in `/home/mudler/bench/COMBINED_DEFINITIVE.txt` plus +nsys/run artifacts under `/home/mudler/bench/profgap/` and +`/home/mudler/bench/postssm_decomp/`; these do not form a packaged +`ntg16`/`ntg64` difference-method report. + +W4A16/Marlin evidence exists in `/home/mudler/bench/vllm_prefix.log`, +`/home/mudler/bench/profgap/vllm_moe_decode.run.log`, and +`/home/mudler/bench/marlin_gate/kl_marlin.log`. +`/home/mudler/llama-paged-dev/LEVER3_ACTQUANT_FUSION_RESULTS.md` records the +parity conclusion: W4A16/Marlin is a precision-change lever, not a bit-exact +llama.cpp parity lever. + +GDN M5/M8 evidence exists in `/home/mudler/bench/COMBINED_DEFINITIVE.txt` +(`GDN CONFIG C (M8)` and production defaults noting GDN M5), +`/home/mudler/llama-paged-dev/LEVER1_GATHER_RESULTS.md`, and +`/home/mudler/llama-paged-dev/CONV_STATE_FUSION_RESULTS.md`. + +S3 evidence exists in `/home/mudler/bench/DEFINITIVE_S3ab.txt`; that A/B shows +S3-on was worse unless paired with `LLAMA_PAGED_PREFILL_PERIOD=1`, matching +`/home/mudler/bench/COMBINED_DEFINITIVE.txt` where S3 is recorded as off by +default. No separate self-contained adaptive-scheduling proof artifact was +found beyond the S3 and prefill-knob artifacts. + +## Open Items + +## Phase 6 Serving nsys Classifier + +Exact fork head `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` was mirrored to +`/home/mudler/llama-phase6-source` on DGX and rebuilt with CUDA Release, +`CMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc`, and +`CMAKE_CUDA_ARCHITECTURES=121`. + +Pre-profile gates passed: + +- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Serving nsys artifacts: + +- llama.cpp: + `/home/mudler/bench/phase6_serving_nsys/llama_server_n128/`. +- vLLM: + `/home/mudler/bench/phase6_serving_nsys/vllm_server_n128/`. + +Same h2h shape (`n=128`, `ptok=128`, `gen=128`) under nsys: + +| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s | +|--------|------------------|------------------|---------------| +| llama.cpp | 4.05 | 591.0 | 1567.4 | +| vLLM | 6.95 | 961.1 | 5073.6 | + +llama.cpp bucket highlights: + +- `gated_delta_net_cuda`: 33.7% GPU kernel time, 10.21s. +- NVFP4 `mul_mat_q`: 24.3% + 5.5% for the largest grouped variants, 9.04s + combined. +- `quantize_mmq_nvfp4`: 2.7%, 0.81s. +- `flash_attn_tile`: 1.3%, 0.38s. +- CUDA API: `cudaStreamSynchronize` 76.5% API time, 23.66s over 106585 calls; + 8028 synchronizes followed `cudaMemcpyAsync` and summed 21.41s. + +vLLM bucket highlights: + +- `fused_recurrent_gated_delta_rule_packed_decode_kernel`: 16.6%, 8.95s. +- `marlin_moe_wna16::Marlin`: 11.9% plus smaller Marlin-MoE variants. +- `flash_fwd_splitkv_kernel`: visible split-K FA decode rows at 0.6% + 0.1%. +- The vLLM delayed profile still contains startup/module-load API noise; prefer + h2h and GPU kernel buckets over API percentages for vLLM. + +Rejected Phase 6 sampler experiment: + +- Patch idea: in backend distribution sampling, skip the random uniform upload + when prior backend filters already collapsed candidates to one token + (`temperature=0` path). +- Gates passed: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. + - `MUL_MAT_ID`: `806/806` on CUDA0. +- Serving A/B did not clear the performance gate: no-nsys reps were `4.19` and + `3.55` tok/s/seq. The fork patch was reverted; no commit and no LocalAI patch + were created. + +Next measured target: + +- H3 is elevated above another W4A16/kernel-shape pass: llama.cpp spends 33.7% + of GPU time in GDN decode versus vLLM's 16.6%, and vLLM remains 1.63x faster + on aggregate decode for the same serving shape. Use existing `GDN_NW` and + `GDN_CPW` controls to grid-search live-width-adaptive GDN launch parameters + before changing source. + +## Phase 6 GDN Narrow-Serving Env Grid + +Artifact: `/home/mudler/bench/phase6_serving_nsys/gdn_grid/`. + +Clean binaries were rebuilt after reverting the rejected sampler experiment. +Grid shape was `n=128`, `ptok=128`, `gen=64` to keep each isolated server run +bounded. + +| Setting | decode tok/s/seq | decode agg tok/s | Decision | +|---------|------------------|------------------|----------| +| default | 3.91 | 647.9 | baseline | +| `GDN_NW=4 GDN_CPW=1` | 3.80 | 628.9 | reject | +| `GDN_NW=8 GDN_CPW=2` | 3.94 | 624.5 | reject | +| `GDN_NW=8 GDN_CPW=4` | 3.91 | 647.6 | reject | +| `GDN_NW=8 GDN_CPW=8` | 4.00 | 636.9 | no material win | +| `GDN_NW=16 GDN_CPW=4` | 3.85 | 637.5 | reject | +| `GDN_NW=16 GDN_CPW=8` | 3.96 | 652.0 | no material win | + +Result: + +- Rejected as an env-only lever. Existing GDN geometry variants are too close in + this serving gate to justify a source change. +- Next focus moves back to the largest differentiating kernel bucket: + llama.cpp's NVFP4 grouped `mul_mat_q` bucket (~30% GPU time) versus vLLM's + Marlin-MoE bucket. + +## Phase 6 MoE MMQ Tile Env Grid + +Artifact: `/home/mudler/bench/phase6_serving_nsys/mmq_grid/`. + +Shape: `n=128`, `ptok=128`, `gen=64`. + +| Setting | decode tok/s/seq | decode agg tok/s | Decision | +|---------|------------------|------------------|----------| +| default | 3.90 | 645.3 | baseline | +| `LLAMA_MOE_AUTO_TILE=0` | 3.90 | 655.3 | tied/no material win | +| `LLAMA_MOE_DECODE_TILE=32` | 3.82 | 635.9 | reject | +| `LLAMA_MOE_DECODE_TILE=48` | 3.81 | 637.3 | reject | +| `LLAMA_MOE_DECODE_TILE=96` | 3.84 | 642.8 | reject | +| `LLAMA_MOE_DECODE_TILE=128` | 3.84 | 640.6 | reject | +| `LLAMA_MOE_MMQ_X=32` | 3.76 | 642.0 | reject; prefill worsened | + +Result: + +- Rejected as an env-only lever. Existing grouped-MMQ tile and auto-selector + knobs do not materially close the serving gap. +- A source patch that only retunes the current tile selector is not justified. + The next useful MoE lever would need a structural change closer to vLLM's + Marlin-MoE/fused-MoE shape, or the work should move to the synchronous + serving input/sampler path with a measurable non-greedy workload. + +## Open Items + +- No current env-only lever clears the serving performance gate. Scope the next + source candidate against either structural MoE decode fusion or async serving + input/sampler uploads, with a workload that proves the target bucket matters. +- Phase 7 must keep the canonical MoE and dense md5 gates as the first + inference-safety check before any performance result is accepted. + +## Phase 7 Source-Candidate Test Gate + +Fork commit `cd56cf037379b084d6bb0ed47db8b785c828be86` added patch +`0051-test-paged-cover-MoE-swiglu-down-chain.patch`. This is a test-only patch; +it does not change the production inference path. + +Fresh DGX gates from `/home/mudler/bench/phase7_source_scope/`: + +- MoE greedy md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense greedy md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Baseline `MUL_MAT_ID`: `806/806`. +- New `MOE_SWIGLU_DOWN`: `7/7`. + +The new gate covers the merged MoE gate_up -> SWIGLU -> down-projection graph +shape needed before attempting a batched NVFP4 down-input quantization fusion. + +## Phase 7 SWIGLU-Down Fusion Candidate Rejected + +Attempted candidate: fuse `GGML_OP_GLU(SWIGLU)` into the NVFP4 activation +quantization feeding the MoE down-projection `MUL_MAT_ID`, while keeping the +existing grouped-MMQ kernel. The patch was kept behind +`GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` during validation. + +DGX artifacts: + +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_swiglu_down_optin.txt` +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_after_optin.txt` +- `/home/mudler/bench/phase7_source_scope/default_gates_after_optin/` +- `/home/mudler/bench/phase7_source_scope/optin_gates/` +- `/home/mudler/bench/phase7_source_scope/serving_ab/` + +Correctness and inference gates: + +- Forced fusion `MOE_SWIGLU_DOWN`: `7/7`. +- Broad default `MUL_MAT_ID`: `806/806`. +- Default md5 after opt-in gating stayed canonical: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Opt-in fusion md5: + - MoE `07db32c2bcb78d17a43ed18bc22705cd`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. + +Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`, `--no-cache`): + +| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict | +|------|------------------|------------------|---------------|---------| +| default | 3.92 | 657.1 | 1456.0 | baseline | +| `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1` | 3.88 | 667.4 | 1462.9 | reject; md5 drift and flat A/B | + +Result: + +- Rejected as a production patch. The opt-in path changes the paged-MoE md5 + into the non-paged namespace and does not materially improve serving. +- Root-cause note for future attempts: the first fused-op gate failed because + the fused quantizer used compact GLU-output strides to read split `gate`/`up` + views. Split views stride over the merged gate/up tensor; using source-view + strides fixed the op gate but not the end-to-end md5 drift. + +## Phase 7 Weighted-Combine Test Gate + +Fork commit `3ef7eb9e4d` added patch +`0052-test-paged-cover-MoE-weighted-combine-chain.patch`. This is a test-only +patch; it does not change the production inference path. + +The new `MOE_WEIGHTED_COMBINE` whole-graph gate covers: + +`down MUL_MAT_ID -> router-weight ggml_mul -> rank-ordered expert views/adds`. + +DGX artifact: + +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_green.txt` + +DGX result: + +- `test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1`: `7/7`. + +This gate is the correctness target for the next candidate: a deterministic +post-down MoE weighted-combine fusion that preserves current f32 product and +rank-order add semantics while avoiding the rejected SWIGLU/FP4-quantization +shortcut. + +## Phase 7 Weighted-Combine Fusion Candidate Rejected + +Attempted candidate: fuse the post-down MoE router-weight multiply and +rank-ordered add fan-in: + +`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`. + +The candidate was fork-first, default-on during validation, and had a rollback +env switch: `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1`. + +DGX artifacts: + +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_moe_weighted_combine_orderfix.txt` +- `/home/mudler/bench/phase7_source_scope/test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_gates_chat/` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_nsys_completion/` +- `/home/mudler/bench/phase7_source_scope/weighted_combine_orderfix_serving_ab/` +- Rejected diff: + `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff` + +Correctness and inference gates: + +- `MOE_WEIGHTED_COMBINE`: `7/7`. +- Broad `MUL_MAT_ID`: `806/806`. +- Canonical transcript md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. + +Nsight proof: + +- Disabled run: no `k_moe_weighted_combine` kernels. +- Fused run: `110` `k_moe_weighted_combine` launches. + +Serving A/B (`n=128`, `ptok=128`, `gen=64`, `/v1/completions`): + +| path | decode tok/s/seq | decode agg tok/s | prefill tok/s | verdict | +|------|------------------|------------------|---------------|---------| +| `LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` | 2.63 | 417.5 | 1345.2 | baseline | +| fused default | 2.63 | 417.0 | 1346.9 | reject; kernel fires but A/B is flat | + +Result: + +- Rejected as a production patch. The patch is md5-safe and the kernel fires, + but it does not improve the bounded serving workload. Keep patch `0052` as a + useful regression gate; do not retry this exact fan-in-only fusion unless a + fresh profile shows the weighted/add fan-in as a material bucket. + +## Phase 8 Ragged MoE Dispatch Scope + +Plan: `docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md`. + +The next candidate is profile-gated before source work: + +- Target a fused routed-expert `MUL_MAT_ID` dispatch path for ragged serving + decode, not another post-down fan-in fusion. +- First decompose live llama.cpp and vLLM MoE serving at `n=128`, `ptok=128`, + `gen=64` with Nsight and `/home/mudler/bench/bucket.py`. +- Promote only if `mm_ids_helper`, activation quant/gather, grouped MMQ, or + related MoE dispatch rows are material and not hidden by GDN or FA. +- Keep the backend-sampling/logit-bias upload cache as a non-default follow-up; + it requires `--backend-sampling` and request `backend_sampling: true` with + non-empty `logit_bias` or `ignore_eos`. + +Required promotion gates remain: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. +- `MUL_MAT_ID`: `806/806` on CUDA0. +- Any fused dispatch prototype must start default-off behind + `LLAMA_MOE_FUSED_DISPATCH=1`. + +Profile-gate result: + +- Clean llama.cpp artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`. +- vLLM artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`. +- A stale first llama profile under `llama_n128/` is intentionally ignored + because the binary still contained the rejected weighted-combine kernel before + the clean-source rebuild. + +Throughput: + +| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s | +|--------|------------------|------------------|---------------| +| llama.cpp | 2.70 | 412.1 | 1368.3 | +| vLLM | 7.02 | 1036.6 | 5277.7 | + +llama.cpp bucket highlights from the clean profile: + +- GDN: `4680.27 ms`, `38.12%`. +- `mmq_nvfp4`: `2745.11 ms`, `22.36%`. +- `act_quant`: `441.42 ms`, `3.60%`. +- MoE dispatch: `183.67 ms`, `1.50%`. +- `ew_add` fan-in: `280.15 ms`, `2.28%`. + +Decision: + +- Promote to a test-only ragged `MUL_MAT_ID` gate before production source. +- Do not implement fused dispatch yet. Standalone `mm_ids`/`gather_mmq` helper + time is small; a source patch must reduce the larger grouped-MMQ/activation + movement bucket and still beat the `+5%` serving A/B gate. + +## Phase 8 Ragged MoE Dispatch Test Gate + +Fork commit `e21732fc4` added patch +`0053-test-paged-cover-ragged-MoE-dispatch.patch`. This is a test-only patch; +it does not change the production inference path. + +The new `MUL_MAT_ID_RAGGED_MOE` gate covers: + +- one small F32 wiring case, +- NVFP4 with `n_mats=256`, `n_used=8`, `m=768`, `k=2048`, + `n in {1, 8, 33, 128, 257}`, +- deterministic unique top-k ids skewed toward hot experts, including expert + `255`, leaving many experts empty. + +DGX artifact: + +- `/home/mudler/bench/phase8_ragged_moe_dispatch/test_backend_ops_mul_mat_id_ragged_moe_fixed.txt` + +DGX result: + +- `test-backend-ops test -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE -j 1`: `6/6`. + +Debug note: + +- The first version of the gate failed because the deterministic IDs produced + duplicate expert IDs within token 0. That is not a valid top-k routing shape + and caused a CPU/CUDA mismatch followed by a CUDA fault. The committed gate + preserves unique expert IDs per token while keeping cross-token load skew. + +Production-source decision: + +- Do not start a Phase 8 production CUDA patch yet. +- Code inspection found that the existing native-FP4 MoE path already de-dups + broadcast activation quantization when `ne11 == 1`, then gathers FP4 blocks + before grouped MMQ. +- The measured helper rows are small (`mm_ids=0.66%`, `gather_mmq=0.42%`). + A metadata-only fused-dispatch hook would not plausibly clear the `+5%` + serving A/B gate. +- A future source candidate must reduce `mmq_nvfp4` (`22.36%`) or `act_quant` + (`3.60%`) directly, without D2H id readback, new stream synchronizations, or + md5 drift. + +## Phase 9 MTP Draft Smoke Gate + +Phase 9 challenged the older "MTP absent" assumption. The current fork has +Qwen3.5/3.6 `draft-mtp` support and the DGX MoE GGUF contains MTP metadata and +tensors: + +- `qwen35moe.nextn_predict_layers` +- `blk.40.nextn.eh_proj.weight` +- `blk.40.nextn.shared_head_norm.weight` +- `blk.40.nextn.enorm.weight` +- `blk.40.nextn.hnorm.weight` + +Smoke artifacts: + +- Failing default pre-patch: + `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke.err`. +- Passing explicit CPU-sampled draft: + `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_no_backend_sampling.err`. +- Passing default after patch: + `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_default_after_patch.err`. + +Finding: + +- `draft-mtp` runs with the current model when backend draft sampling is off. +- The default path previously emitted: + `backend sampling requires at most one output token per sequence (seq_id 0 had 2)`. +- Patch `0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patch` + disables backend draft sampling inside the MTP implementation until the + backend sampler supports multi-output verification batches. + +DGX smoke after patch: + +- `rc=0`. +- Warning emitted: + `backend draft sampling is disabled for MTP`. +- `n_drafted=5`, `n_accept=4`, acceptance `80.000%`. +- Output tail: + `The capital of France is Paris, a city renowned for its rich history`. + +Normal inference gates after patch: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Decision: + +- Keep Phase 9 as an opt-in speculative smoke/fix only. +- Do not enable MTP by default in LocalAI or llama-server. +- Do not benchmark MTP as a parity win until a serving/API phase adds rollback + gates for hybrid SSM/KV state and measures target verification throughput. + +## Phase 14 MTP Rollback and Inference-Safety Gate + +Phase 14 tested the missing safety question from Phase 9: whether MTP +speculative rejection can run against the actual Qwen3.6 MoE GGUF without +corrupting paged KV or recurrent GDN state. + +Artifacts: + +- `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err` +- `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err` +- `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.out` +- `/home/mudler/bench/phase14_mtp_rollback/mtp_n{8,16,24,48}.out` +- `/home/mudler/bench/paged_inference_gates/20260701_041117` + +Safety evidence: + +- `test-recurrent-state-rollback` on + `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` exited `0` and logged + `recurrent rollback checkpoint restored successfully`. +- MTP stderr logged bounded recurrent rollback support: + `the context supports bounded partial sequence removal`. +- MTP partial rejection occurred at `temp=0`: + `n_drafted=39`, `n_accept=20`, `accept=51.282%`. +- The backend sampler multi-output error stayed absent; the expected + `backend draft sampling is disabled for MTP` warning was present. +- Raw greedy text was prefix-equivalent after normalization for + `n=8,16,24,32,48`; no first differing token was found. Exact transcript md5 + is not used for this cross-frontend gate because `llama-speculative-simple` + emits accepted token groups and can overrun `llama-completion -no-cnv` for + the same `-n`. + +Normal inference gates after Phase 14: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`. + +Decision: + +- MTP rollback safety is green enough to scope a Phase 15 serving/API + throughput gate. +- Do not enable MTP by default. +- Do not count MTP as a GB10 speed-parity win until serving results show useful + target-verification throughput under the canonical inference gates. + +## Phase 15 MTP Serving Throughput Gate + +Phase 15 measured the direct `llama-server` serving path after Phase 14 proved +rollback safety. The test compared two same-shape arms: + +- baseline: no speculative decoding, +- MTP: `--spec-type draft-mtp --spec-draft-n-max 3 + --no-spec-draft-backend-sampling`. + +Artifact: + +- `/home/mudler/bench/phase15_mtp_serving/20260701_042005` + +Harness: + +- `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` +- `NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128` +- client: `/home/mudler/bench/h2h_cli3.py` against `/v1/completions` + +Result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | TTFT mean ms | wall s | +|---|---:|---:|---:|---:|---:|---:| +| baseline | 8 | 192.5 | 247.8 | 30.70 | 1181.1 | 5.318 | +| MTP | 8 | 92.9 | 109.8 | 14.26 | 1691.5 | 11.017 | +| baseline | 32 | 305.4 | 406.0 | 12.02 | 2762.2 | 13.412 | +| MTP | 32 | 95.8 | 111.7 | 3.61 | 4545.6 | 42.727 | +| baseline | 128 | 429.5 | 662.4 | 4.31 | 7747.2 | 38.144 | +| MTP | 128 | 100.3 | 138.5 | 0.97 | 20385.7 | 163.289 | + +MTP did actually run: + +- server initialized `draft-mtp` with bounded partial sequence removal, +- response/server timings included draft counters, +- server log tail included `#gen tokens = 17293`, `#acc tokens = 15493`. + +Normal inference gates before and after the A/B: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- `MUL_MAT_ID`: `806/806`, `Backend CUDA0: OK`. + +Decision: + +- Reject current `llama-server` MTP as a GB10 serving parity lever. +- Do not enable MTP by default in LocalAI or llama-server. +- Do not tune `spec-draft-n-max` blindly. The regression is large enough that + the next MTP phase, if any, must start with graph/batch-shape profiling. + +Likely root cause: + +- Baseline serving preserved heavy graph reuse (`graphs reused = 361` in the + `n=128` tail). +- MTP serving showed `graphs reused = 1` and high per-slot eval time at high + concurrency. +- The working hypothesis is that MTP verification/draft batch shape churn + defeats the paged decode graph-reuse wins, so extra verification dominates + despite high draft acceptance. + +## Phase 16 MTP Graph-Reuse Profile + +Phase 16 profiled the Phase 15 hypothesis with +`nsys --cuda-graph-trace=node` on a smaller direct serving shape: + +- server: `-c 32768 -b 2048 -ub 512 --parallel 32`, +- client: `h2h_cli3.py -n 8 --ptok 64 --gen 64`, +- arms: baseline vs `--spec-type draft-mtp --spec-draft-n-max 3`. + +Artifact: + +- `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016` + +Result: + +| arm | decode agg t/s | decode per-seq t/s | wall s | graph reuse | +|---|---:|---:|---:|---:| +| baseline | 230.5 | 28.07 | 3.523 | `graphs reused = 62` | +| MTP | 97.7 | 12.83 | 7.049 | `graphs reused = 1` | + +MTP drafted and accepted tokens: + +- `draft acceptance = 0.81481 (44 accepted / 54 generated)`, +- `#gen tokens = 460`, `#acc tokens = 346`. + +Nsight kernel summaries also show materially more GPU work in the MTP run: +roughly `5.89 s` top-level GPU kernel time versus `2.59 s` for the baseline +small profile. + +Decision: + +- Phase 16 supports the Phase 15 root-cause hypothesis: current MTP serving + defeats the paged decode graph-reuse advantage and increases GPU work. +- A future source phase must start at speculative verification batch shapes and + graph-reuse keys, not at MTP draft-length tuning. + +## Phase 10 GDN C32 Slab Baseline and Source Check + +Phase 10 starts a separate GDN prefill path; it does not reopen the rejected +decode `GDN_NW/GDN_CPW` grid. + +Current M5 baseline artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt` + +Current M5 baseline: + +| Model | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|----|----|---|----------|----------|-------| +| MoE | 512 | 4 | 32 | 2314.18 | 359.16 | 2220.48 | +| MoE | 2048 | 4 | 32 | 2439.95 | 389.43 | 2415.16 | +| Dense | 512 | 4 | 32 | 978.97 | 143.56 | 936.71 | +| Dense | 2048 | 4 | 32 | 1023.61 | 184.09 | 1014.59 | + +Source check: + +- A C32 M5 candidate cannot be implemented as a launcher-only shortcut. +- The current M5 form-T apply path stores one 16-row tile of `U=T*RHS` in + registers, syncs, then overwrites `Ud`. That is safe for `C=16`. +- For `C=32`, a naive two-row-tile loop would overwrite RHS rows before all + output rows are computed, and the current apply call only covers rowbase `0`. +- A correct C32 slab candidate must add a separate staging strategy for all + `C*DV_TILE` U values, then run focused `GATED_DELTA_NET` op gates before any + S_PP comparison. + +Decision: + +- A default-off C32 slab candidate was implemented and rejected by the + performance gate. +- The candidate was correctness-clean only after fixing a tail-chunk staging + bug: rows `t >= Cc` in the staged `U=T*RHS` copy-back must be zeroed before + state/output math. Before that fix, the dense gate produced a degenerate + transcript even though the focused op gate passed. +- After the tail fix, both default and forced-C32 modes matched the canonical + md5 gates exactly: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed because md5 stayed stable after the tail fix. + +Correctness artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_default_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_default_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_moe_c32_after_tailfix.md5` +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/gate_dense_c32_after_tailfix.md5` + +Performance A/B artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 | +| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 | +| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 | +| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 | +| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 | +| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 | +| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 | +| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 | + +Rejected diff: + +- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff` + +Conclusion: + +- Do not ship Phase 10 C32 slab as implemented. +- C32 slab is not a maintainable shortcut toward parity because duplicated + A/T recomputation per value slab outweighs the intended state-traffic + reduction. +- A future GDN prefill attempt should either share the `A/T` work across value + slabs or switch to a different FLA-style chunk design; it should not repeat + this env-gated two-slab M5 variant. + +## Phase 11 GDN M5 QS-Early Rejection + +Phase 11 tested a smaller C=16 M5 scheduling shortcut instead of reopening C32: +move the `QS = Qc * S0` state-boundary tensor-core pass earlier and keep it +default-off behind `GDN_M5_QS_EARLY=1`. + +Correctness artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_default.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_default.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_moe_qs_early.md5` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gate_dense_qs_early.md5` + +Correctness result: + +- Default and QS-early paths matched canonical md5 exactly: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + +Performance artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 | +| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 | +| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 | +| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 | +| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 | +| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 | +| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 | +| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 | + +Rejected diff: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff` + +Conclusion: + +- Do not ship Phase 11 QS-early as implemented. +- Merely moving the QS state-boundary product earlier is not enough; it remains + an extra MMA pass and does not reduce the M5 critical path. +- The next GDN attempt should skip local scheduling-only changes and scope a + true shared-A/Ai blocked-solve or global-scratch design, with an explicit + scratch/synchronization cost model before coding. + +## Phase 12 GDN Shared-A/Ai Cost Model + +Phase 12 evaluated whether a real shared-A/Ai design is credible enough to +prototype after the C32 slab and QS-early shortcut rejections. + +Cost-model doc: + +- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` + +Metadata artifact: + +- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +Model dimensions: + +| Model | GDN layers | H | S_v | Metadata basis | +|-------|------------|---|-----|----------------| +| MoE | 30 inferred | 32 inferred | 128 | `ssm.inner_size=4096`, `ssm.state_size=128` | +| Dense | 48 inferred | 48 inferred | 128 | `ssm.inner_size=6144`, `ssm.state_size=128` | + +Dynamic-smem result for `S_v=128`: + +| Shape | Bytes | KiB | Fits GB10 dynamic smem? | +|-------|-------|-----|-------------------------| +| C16 full-width | 93,376 | 91.19 | yes | +| C32 full-width | 127,360 | 124.38 | no | +| C32 slab64 + U staging | 94,592 | 92.38 | yes | + +Ai scratch result at `npp=2048,npl=32,BT=32,f32`: + +| Model | Ai scratch MiB | 3x Ai traffic MiB | +|-------|----------------|-------------------| +| MoE | 256.0 | 768.0 | +| Dense | 384.0 | 1152.0 | + +Decision: + +- GO for a default-off Phase 13 global-Ai32 prototype. +- Constraints: `BT=32`, f32 Ai, two `dv_tile=64` slabs, `GDN_GLOBAL_AI32=1`. +- The prototype must be rejected if it is flat or slower; do not iterate into + f16/BF16 Ai unless f32 proves the schedule can win. + +## Phase 13 GDN Global-Ai32 Prototype Rejection + +Phase 13 implemented the Phase 12 design in the llama.cpp fork as a default-off +prototype behind `GDN_GLOBAL_AI32=1`. + +Implementation summary: + +- Added a f32 Ai precompute kernel. +- Added C32, `dv_tile=64` slab consumption through the chunked GDN path. +- Allocated Ai scratch from the ggml CUDA pool only for supported calls. +- Kept the default C16 M5 path unchanged. + +Correctness artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_default.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gated_delta_net_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_default.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_default.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_moe_global_ai32.md5` +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/gate_dense_global_ai32.md5` + +Correctness result: + +- Default and Global-Ai32 paths matched canonical md5 exactly: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + +Performance artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt` + +Performance A/B: + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 | +| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 | +| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 | +| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 | +| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 | +| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 | +| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 | +| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 | + +Rejected diff: + +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + +Conclusion: + +- Do not ship Phase 13 Global-Ai32 as implemented. +- The global scratch split is correctness-safe but slower than shipped C16 M5. +- Per the Phase 12/13 decision rule, stop GDN kernel work on GB10. The remaining + vLLM GDN advantage requires a fuller FLA-style blocked solve or hardware + assumptions that do not fit this GB10 patch stack without a regression. + +## Phase 8 Ragged MoE Dispatch Safety Rerun + +Phase 8 had already closed the live ragged MoE helper path by profile: +`mm_ids=0.66%`, `gather_mmq=0.42%`, while `mmq_nvfp4=22.36%` and +`act_quant=3.60%`. The only source patch kept from the phase is the test gate +(`0053-test-paged-cover-ragged-MoE-dispatch.patch`); the metadata-only +`LLAMA_MOE_FUSED_DISPATCH` shortcut is rejected. + +Rerun artifacts: + +- `/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt` +- `/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/` + +Safety result: + +- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Conclusion: + +- The inferencing gates remain canonical on the unchanged production path. +- Do not add a metadata/helper-only fused-dispatch hook. A future Phase 8 + production candidate must reduce `mmq_nvfp4` or activation movement directly, + stay free of D2H id readback and new stream synchronizations, and then pass + the same md5/op gates before any serving A/B is considered. + +## Phase 18 MTP Shape Trace + +Phase 18 implemented the Phase 17 instrumentation-only recommendation as +patch `0055-feat-server-trace-speculative-batch-shapes.patch`. + +Implementation summary: + +- Added default-off `LLAMA_SPEC_SHAPE_TRACE=1` logging in + `server_slot::handle_last_sampled_token()`. +- Normal decode logs one row/output per slot. +- MTP verification logs `K + 1` rows/outputs per speculative slot, including + draft length and `slot.spec_i_batch` range. +- No scheduler, graph-key, KV, logits, acceptance, or rollback behavior changed. + +Red/green trace artifacts: + +- Red check before patch: `/home/mudler/bench/phase18_mtp_shape_trace_red` +- Green check after patch: `/home/mudler/bench/phase18_mtp_shape_trace_green` + +Green trace sample: + +```text +spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5 +spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6 +spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9 +``` + +Disabled-env check: + +- `LLAMA_SPEC_SHAPE_TRACE` unset emitted no `spec shape:` lines. + +Inference gate artifact: + +- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + +Safety result: + +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Conclusion: + +- Patch 0055 is safe instrumentation and does not break inferencing on the + canonical gated paths. +- The trace confirms per-step MTP verification shape variation even in a tiny + request (`rows=4` and `rows=3`). +- A follow-up scheduler experiment is not yet justified. First use this trace + under real serving load to measure draft-length bucket entropy. + +## Phase 19 MTP Serving Shape Entropy + +Phase 19 ran Phase 18's shape trace under the direct serving harness with +`LLAMA_SPEC_SHAPE_TRACE=1`, `NPL="8 32 128"`, `GEN=64`, and `PTOK=128`. + +Artifact: + +- `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534` + +Pre/post gate result: + +- Pre-gate and post-gate both passed. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Serving A/B: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +Shape entropy summaries: + +- `shape_entropy_summary.tsv` +- `step_shape_summary.tsv` + +Per-slot draft distribution: + +| window | verify slots | draft counts | top draft share | unique `batch_before` | +|--------|--------------|--------------|-----------------|-----------------------| +| n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 | +| n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 | +| n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 | + +Per-step aggregate shape: + +| window | steps | unique total rows | top full-shape rows | +|--------|-------|-------------------|---------------------| +| n8 | 26 | 12 | `32` rows for 14 steps | +| n32 | 32 | 20 | `128` rows for 13 steps | +| n128 | 37 | 34 | `512` rows for 4 steps | + +Decision: + +- Do not implement the Phase 20 group/defer-by-draft scheduler shortcut on this + evidence. +- Draft length is already stable (`draft=3` is >96% of verify slots), yet MTP + still regresses decode throughput hard and worsens TTFT. +- The residual shape churn is dominated by active-slot/tail churn and the MTP + `K + 1` verification-row expansion, not mixed draft lengths. +- Any future MTP parity work needs a deeper target-verify graph/state design, + not a small server scheduling shortcut. + +## Phase 20 Current-Stack Serving Snapshot + +Phase 20 refreshed the MoE paged-vs-vLLM serving baseline on the current clean +DGX mirror after the MTP investigation. + +Artifact: + +- `/home/mudler/bench/phase20_current_snapshot/20260701_050621` + +Current source: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` + +Pre/post gate result: + +- Pre-gate and post-gate both passed. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. + +Serving snapshot: + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +Latency/prefill snapshot: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | +| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | +| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + +Decision: + +- The latest clean stack is still not at vLLM serving parity on GB10. +- The user-visible gap is dominated by prefill/TTFT and e2e serving throughput, + not by a now-open MTP or scheduler shortcut. +- Keep MTP scheduler work closed. The next credible parity path is either a + datacenter-Blackwell rerun or a larger fused-kernel project outside the + low-conflict GB10 patch stack. + +## Phase 21 Current-Stack Serving Harness + +Phase 21 made the Phase 20 current-stack serving snapshot repeatable from the +LocalAI backend tree. + +New script: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +Purpose: + +- targets the clean `~/llama-phase6-source` mirror by default; +- rejects busy docker, `local-ai-worker`, GPU compute, or owned GPU-lock state; +- builds the current llama.cpp targets; +- runs pre/post `paged-inference-gates.sh`; +- runs paged and vLLM serving arms with the same h2h client; +- writes paged/vLLM ratio summaries. + +Verification: + +- local `bash -n` passed; +- local `--help` passed; +- DGX `DRY_RUN=1` validated required paths and preflight without launching + servers. + +Dry-run artifact: + +- `/home/mudler/bench/phase21_harness_dryrun/20260701_051757` + +Decision: + +- Use `paged-current-serving-snapshot.sh` for future current-stack GB10 serving + snapshots. +- Do not use stale DGX `~/bench/combined_definitive.sh` without porting it to + `~/llama-phase6-source` and the owner-file lock discipline. + +## Phase 22 Patch-Series Mirror Invariant + +Phase 22 verified that the LocalAI on-disk paged patch series still reconstructs +the canonical llama.cpp fork tree after patch `0055`. + +Method: + +- Create a fresh worktree at Makefile pin + `0ed235ea2c17a19fc8238668653946721ed136fd`. +- Apply every `backend/cpp/llama-cpp-localai-paged/patches/paged/0*.patch` with + strict `git apply`, matching the LocalAI build path. +- Stage the result and compare `git write-tree` with the fork branch HEAD tree. + +Result: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb +fork_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb +``` + +Decision: + +- The patch series is drift-free against fork branch `localai-paged` at + `fb9402661 feat(server): trace speculative batch shapes`. + +## Phase 24 Snapshot Hardware Report + +Phase 24 made the current-stack serving harness record hardware identity before +any server starts. This keeps GB10/workstation Blackwell evidence separate from +future datacenter-Blackwell reruns. + +Script change: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now + writes `hardware.txt` after preflight and before the `DRY_RUN=1` exit. + +Recorded fields: + +- `nvidia-smi -L`; +- `nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap`, with + fallback to name/driver/memory if `compute_cap` is unavailable; +- `gpu_name`; +- `hardware_class`; +- parity note for that hardware class. + +Verification: + +- local `bash -n` passed; +- local `--help` passed; +- DGX `DRY_RUN=1` validated preflight and wrote `hardware.txt` without launching + servers. + +Dry-run artifact: + +- `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741` + +DGX hardware result: + +```text +GPU 0: NVIDIA GB10 +driver=580.159.03 +compute_cap=12.1 +hardware_class=gb10_or_workstation_blackwell +``` + +Decision: + +- Future snapshot artifacts are self-describing enough to prevent accidental + GB10-to-datacenter generalization. +- The Phase 20 GB10 closure still applies to `gb10_or_workstation_blackwell`; + datacenter Blackwell needs a fresh run of the same methodology. + +## Phase 25 Snapshot Gate Summary + +Phase 25 made current-stack serving artifacts self-auditing for the inference +gates that protect the paged path. + +Script change: + +- `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` now + writes `gate_summary.tsv` after the post gate in a full run. +- The script also supports `--summarize-gates ART` to generate the same summary + from existing `gate_pre/` and `gate_post/` artifacts without launching + servers. + +Recorded rows: + +- pre/post MoE transcript md5 versus + `8cb0ce23777bf55f92f63d0292c756b0`; +- pre/post dense transcript md5 versus + `5951a5b4d624ce891e22ab5fca9bc439`; +- pre/post backend op rows, currently `MUL_MAT_ID`, with the parsed passed/total + count. + +Verification: + +- Red check: Phase 20 initially had gate artifacts but no `gate_summary.tsv`. +- local `bash -n` passed; +- local `--help` passed; +- DGX `--summarize-gates` against Phase 20 wrote six green rows; +- DGX `DRY_RUN=1` validated the normal path still preflights and writes + `hardware.txt` without launching servers or writing a gate summary before + gates exist. + +Artifacts: + +- Backfilled summary: + `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv` +- Dry run: + `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353` + +Backfilled Phase 20 gate summary: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT_ID ok 806/806 +``` + +Decision: + +- Future full serving snapshots carry compact proof that inference md5/op gates + stayed green before and after the paged-vs-vLLM run. +- Treat `gate_summary.tsv` plus `hardware.txt` as the quick audit surface before + accepting a parity snapshot. + +## Phase 26 Audited Current-Stack Serving Snapshot + +Phase 26 ran a full current-stack paged-vs-vLLM MoE serving snapshot with the +Phase 24/25 audit files enabled. + +Artifact: + +- `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` + +Current source: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` + +Hardware report: + +- `hardware_class=gb10_or_workstation_blackwell` +- `GPU 0: NVIDIA GB10` +- driver `580.159.03` +- compute capability `12.1` + +Pre/post gate summary: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving snapshot: + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Latency/prefill snapshot: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 778.6 | 271.1 | 2.87x | 1679.9 | 4485.6 | +| 32 | 2607.4 | 749.4 | 3.48x | 1698.8 | 5427.8 | +| 128 | 7569.6 | 2534.3 | 2.99x | 1668.7 | 5122.0 | + +vLLM startup notes: + +- vLLM selected the expected GB10 backend mix: FlashInfer FP8 projection + kernels, Triton/FLA GDN prefill, FlashAttention, and MARLIN NVFP4 MoE. +- Startup was long because the server loaded three checkpoint shards, loaded + cached torch-compile graphs, ran FlashInfer fp8 GEMM autotuning, and captured + CUDA graphs before the API became ready. + +Decision: + +- The audited current stack still is not at vLLM serving parity on GB10. +- The Phase 20 conclusion is reproduced with stronger audit artifacts: + `hardware.txt`, `gate_summary.tsv`, pre/post full gates, and same-session + paged/vLLM ratios. +- Current paged/vLLM decode ratios remain about `81.5%` at n8, `69.0%` at n32, + and `65.7%` at n128; e2e aggregate ratios remain about `70.6%`, `54.6%`, + and `49.4%`. + +## Phase 27 Graph-Node-Traced Current-Stack Serving Profile + +Phase 27 re-profiled the current clean llama.cpp serving path with CUDA graph +node tracing enabled. This checks the Phase 8 bucket picture against the decode +profiling rule: serving/decode profiles must use `--cuda-graph-trace=node`. + +Artifact: + +- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` + +Source and hardware: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` +- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1` +- Nsight Systems `2025.3.2.474-253236389321v0` + +Safety gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post retry | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post retry | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post retry | `MUL_MAT_ID` | ok | `806/806` | + +The first immediate post-gate attempt raced with Nsight teardown and rejected +the run because it detected one compute process even though `nvidia-smi` already +printed no running processes. The post-gate retry started from `docker=0`, +`local_ai_worker=0`, `compute=0`, and a `FREE` owner file. + +Serving sample (`n=128`, `PTOK=128`, `GEN=64`): + +| agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms | +|---------|----------------|--------------------|-------------|--------------| +| 319.9 | 675.5 | 3.9 | 1671.1 | 8363.4 | + +This matches Phase 26's n128 paged decode rate (`673.4` decode_agg_tps) closely +enough to treat the profile as representative for bucket direction. + +Graph-node-traced kernel buckets: + +| macro bucket | time ms | share | +|--------------|---------|-------| +| GDN | 6706.33 | 33.47% | +| MoE/FFN-GEMM | 5871.92 | 29.31% | +| bf16-proj | 2725.07 | 13.60% | +| layout-copy | 1309.99 | 6.54% | +| ew-mul(weight/norm/GDN) | 724.29 | 3.61% | +| act-quant | 697.75 | 3.48% | +| norms/residual | 405.29 | 2.02% | +| ew-add(resid/MoE-fanin) | 361.81 | 1.81% | +| MoE-dispatch | 275.99 | 1.38% | +| FA | 271.03 | 1.35% | + +Fine buckets: + +- `gdn_core`: `5929.85 ms` (`29.59%`) +- `mmq_nvfp4`: `5697.79 ms` (`28.44%`) +- `cublas_bf16_gemm`: `1892.81 ms` (`9.45%`) +- `act_quant`: `697.75 ms` (`3.48%`) +- `mm_ids`: `121.99 ms` (`0.61%`) +- `gather_mmq`: `73.88 ms` (`0.37%`) +- `argsort_topk`: `80.11 ms` (`0.40%`) + +Decision: + +- The graph-node-traced current-stack profile confirms the Phase 8 source + shortcut decision. Metadata/helper work is still too small: `mm_ids`, + `gather_mmq`, and `argsort_topk` together are about `1.38%`. +- A credible GB10 source patch would have to reduce `gdn_core` or + `mmq_nvfp4`/bf16 projection work directly. The low-conflict helper-dispatch + path still should not be reopened. +- The serving profile does not change the Phase 26 parity verdict: n128 paged + decode remains about `675 tok/s`, far below vLLM's same-session `1025 tok/s`. + +## Phase 28 NVFP4 MMQ Occupancy Build-Knob A/B + +Phase 28 tested the remaining small, additive grouped-MMQ occupancy knobs +already present in the llama.cpp fork. This was a build-vs-build A/B only; no +source change was promoted. + +Artifact: + +- `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450` + +Source and hardware: + +- `/home/mudler/llama-phase6-source` +- `f2521ab12 feat(server): trace speculative batch shapes` +- `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1` + +Build/gate results: + +| variant | build result | MoE md5 | dense md5 | `MUL_MAT_ID` | +|---------|--------------|---------|-----------|--------------| +| baseline | existing `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `GGML_CUDA_FP4_MINBLOCKS=2` | built | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `GGML_CUDA_FP4_MMQ_Y=64` | compile-time reject | n/a | n/a | n/a | + +`GGML_CUDA_FP4_MMQ_Y=64` fails the NVFP4 writeback invariant: +`static_assert(nwarps*tile_C::I == mmq_y)`. That also rejects combined +`MMQ_Y=64+MINBLOCKS=2` as a source of evidence. `MMQ_Y=96` is not a valid +low-conflict shortcut for the same row-tile specialization reason, so it was +not promoted to a serving A/B. + +Same-session n128 serving A/B (`PTOK=128`, `GEN=64`, two reps per arm): + +| arm | reps | agg_tps | decode_agg_tps | decode_perseq_tps | prefill_tps | TTFT mean ms | +|-----|------|---------|----------------|--------------------|-------------|--------------| +| baseline | 2 | 328.8 | 705.1 | 3.970 | 1607.4 | 7868.8 | +| `MINBLOCKS=2` | 2 | 326.4 | 689.9 | 3.905 | 1644.9 | 7778.1 | +| ratio | 2 | 0.9927 | 0.9784 | 0.9836 | 1.0233 | 0.9885 | + +Post-serving variant gate remained green: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| post serving | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post serving | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post serving | `MUL_MAT_ID` | ok | `806/806` | + +Decision: + +- `GGML_CUDA_FP4_MINBLOCKS=2` is inference-safe but does not clear the serving + A/B gate; it regressed n128 decode aggregate by about `2.2%`. +- `GGML_CUDA_FP4_MMQ_Y` is not a valid additive shortcut without deeper NVFP4 + writeback retile work. +- Do not promote either knob or add a LocalAI patch. The grouped-MMQ bucket + still needs a structural kernel change, not a launch-bounds/row-tile tweak. + +## Phase 29 Default-Off MoE MMQ Shape Trace + +Phase 29 added evidence-only instrumentation for the structural grouped-MMQ +path that remains after Phase 28. The trace is default-off and lives at the +host-side grouped-MMQ selector so it does not read `expert_bounds` back from the +device or add a synchronization. + +Patch and artifact: + +- Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes` +- LocalAI patch: `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch` +- Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428` + +TDD/build checks: + +| check | result | +|-------|--------| +| RED | `test-cuda-mmq-shape-trace` first failed on missing `ggml-cuda/mmq-shape-trace.h` | +| local GREEN | `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace` | +| DGX CUDA build | `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace` | + +Safety gates: + +| gate | MoE md5 | dense md5 | `MUL_MAT_ID` | trace lines | +|------|---------|-----------|--------------|-------------| +| default-off | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | `0` | +| `LLAMA_MOE_MMQ_SHAPE_TRACE=4` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | `4` | + +Example trace line: + +```text +[LLAMA_MOE_MMQ_SHAPE] type=40 moe=1 ncols_dst=104 nchannels_x=256 ncols_max=13 n_active_est=104 density=1 mmq_x_max=128 mmq_x_lim=64 mmq_x_best=16 mmq_y=128 stream_k=1 +``` + +Decision: + +- This is not a speed patch and should not be counted as parity progress by + itself. +- It gives a bounded, md5-safe way to collect live serving grouped-MMQ shape + evidence before designing the next structural kernel. + +## Phase 30 Live MoE MMQ Shape Distribution + +Phase 30 used patch `0056` under the n128 h2h serving workload to collect the +first 4096 grouped-MMQ selector shapes. This is a measurement-only phase. + +Artifact: + +- `/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300` + +Run: + +- Source: `dgx:~/llama-phase6-source`, commit `826c97a05` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096` +- Workload: h2h `n=128`, `PTOK=128`, `GEN=64` +- Throughput while tracing: `decode_agg_tps=645.8`, `agg_tps=313.3`, + `prefill_tps=1597.9`, `TTFT mean=8192.3 ms` + +Trace summary: + +| bucket | total traced calls | dominant `mmq_x_best` | density range | `ncols_max` range | +|--------|--------------------|-----------------------|---------------|-------------------| +| decode-like (`ncols_max <= 128`) | 1200 | `64` (480), `32` (360), `40` (240), `48` (120) | 1-4 | 26-111 | +| prefill-like (`ncols_max > 128`) | 2896 | `128` (1816), `64` (720), `112` (240), `48` (120) | 5-16 | 132-512 | + +Overall first-4096 distribution: + +| metric | notable values | +|--------|----------------| +| `mmq_x_best` | `128`: 1816, `64`: 1200, `32`: 360, `40`: 240, `48`: 240, `112`: 240 | +| `density` | `16`: 1680, `2`: 480, `1`: 360, `6`: 360, `4`: 240, `5`: 240 | +| `stream_k` | `1`: 4096 | + +Post-run gates: + +| check | status | actual | +|-------|--------|--------| +| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` | + +Decision: + +- Decode serving really is feeding grouped-MMQ small-M tiles: in this trace, + decode-like calls stay at density `1-4` and `mmq_x_best <= 64`. +- Prefill-like calls mostly select `mmq_x_best=128` and density `16`, so a + decode-only structural kernel should not be generalized to prefill without a + separate A/B. +- Every traced call used stream-k, so a replacement kernel must account for the + current stream-k/fixup behavior rather than only conventional tiling. + +## Phase 31 Live MoE MMQ Launch Shape Distribution + +Phase 31 added patch `0057`, a default-off launch trace paired with the Phase 29 +selector trace. It records the actual launch policy after `ntiles_dst`, +`tiles_efficiency_percent`, `stream_k_blocks`, and `fixup_needed` are known. + +Artifact: + +- `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `c78e537b5` +- DGX mirror commit: `dgx:~/llama-phase6-source` `8b75905e9` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096` +- Workload: h2h `n=128`, `PTOK=128`, `GEN=64` +- Throughput while tracing: `decode_agg_tps=691.0`, `agg_tps=337.0`, + `prefill_tps=1500.4`, `TTFT mean=7671.0 ms` + +Launch summary: + +| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency | `ncols_max` range | +|--------|--------------|-----------|----------------------------------|-----------------|-------------------| +| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 | 12-128 | +| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 | 129-510 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` in all three gate runs | + +Decision: + +- Do not pursue a no-fixup/no-stream-k shortcut for n128 serving: the measured + launch path already uses `stream_k_blocks == ntiles_dst` and never runs fixup. +- The remaining grouped-MMQ work is structural small-M kernel work, not launch + overhead. A follow-up should target the decode-like `mmq_x <= 64`, low-density + kernel shape directly and keep the prefill `mmq_x=128` path separate. + +## Phase 32 Small-M MoE MMQ Candidate Classifier + +Phase 32 added patch `0058`, a default-off small-M candidate trace. It does not +change tile selection or launch behavior; it only logs +`[LLAMA_MOE_MMQ_SMALL_M]` lines when the grouped-MMQ selector has produced a +decode-like low-density MoE shape. + +Artifact: + +- `/home/mudler/bench/phase32_small_m_classifier/20260701_070127` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `2a9964d29` +- DGX mirror commit: `dgx:~/llama-phase6-source` `024f494d0` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SMALL_M_TRACE=4096` +- Workload: h2h `n=128`, `PTOK=128`, `GEN=64` +- Throughput while tracing: `decode_agg_tps=689.0`, `agg_tps=343.9`, + `prefill_tps=1566.5`, `TTFT mean=7849.0 ms` + +Candidate summary: + +| metric | notable values | +|--------|----------------| +| total candidates | 4096 | +| `mmq_x_best` | `64`: 1800, `48`: 1096, `40`: 360, `32`: 360, `16`: 360, `24`: 120 | +| density | `4`: 1440, `3`: 1336, `1`: 840, `2`: 480 | +| `ncols_max` | `84`: 600, `128`: 360, `70`: 240, `12`: 240, `97`: 240, `126`: 240 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` in all three gate runs | + +Decision: + +- There is enough live candidate coverage to justify a default-off tile-policy + A/B in Phase 33. +- Start with a small-M MoE-only `mmq_x=16` cap, and consider `8` only if it + compiles and preserves the existing NVFP4 tile invariants. + +## Phase 33 Small-M MoE MMQ Tile Policy A/B + +Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=`, to cap +only the Phase 32 classified small-M MoE grouped-MMQ calls. This tested whether +a vLLM-like smaller M block could improve n128 decode without rewriting the +kernel. + +Artifact: + +- `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136` + +Gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT_ID` | +|------|---------|-----------|--------------| +| default-off | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `LLAMA_MOE_SMALL_M_TILE=16` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| `LLAMA_MOE_SMALL_M_TILE=8` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | +| post-serving | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `806/806` | + +Same-session n128 serving: + +| mode | decode_agg_tps | agg_tps | prefill_tps | ratio vs baseline | +|------|----------------|---------|-------------|-------------------| +| baseline | 672.1 | 339.5 | 1511.4 | 1.000x | +| `LLAMA_MOE_SMALL_M_TILE=16` | 640.3 | 328.9 | 1522.2 | 0.953x | +| `LLAMA_MOE_SMALL_M_TILE=8` | 583.2 | 307.4 | 1442.6 | 0.868x | + +Decision: + +- Reject simple smaller `mmq_x` caps for classified n128 small-M calls. They are + inference-safe but slower. +- A future grouped-MMQ kernel must change the work shape more deeply than the + host-side tile cap, or pivot to a different bucket. + +## Phase 34 MoE MMID Dispatch Route Trace + +Phase 34 added patch `0060`, a default-off `LLAMA_MOE_MMID_ROUTE_TRACE=` +diagnostic around `MUL_MAT_ID` dispatch. It does not alter routing; it logs the +existing route decision as `mmvq`, `mmvf`, grouped `mmq`, `mmf`, or host-sync +`fallback`. + +Artifact: + +- `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `6c332094c` +- DGX mirror commit: `dgx:~/llama-phase6-source` `34a256d14` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMID_ROUTE_TRACE=4096` +- Workload: staggered n128 `llama-server`, `GEN=64` + +Route summary: + +| metric | value | +|--------|-------| +| traced `MUL_MAT_ID` calls | 4096 | +| grouped MMQ | 2776 | +| MMVQ | 1320 | +| host-sync fallback | 0 | +| top shapes | `mmq ne2=12`: 1096, `mmq ne2=18`: 480, `mmvq ne2=8`: 360 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT_ID` | ok | `806/806` in all three gate runs | + +Decision: + +- The current n128 serving path is not hitting the host-sync fallback in traced + `MUL_MAT_ID` calls. The route is graph-safe MMVQ for very small widths and + grouped MMQ above that. +- Do not scope the next parity phase around avoiding fallback dispatch. Scope it + around grouped-MMQ small-M kernel partitioning or another measured bucket. + +## Phase 35 Regular MUL_MAT Route Trace + +Phase 35 added patch `0061`, a default-off `LLAMA_MUL_MAT_ROUTE_TRACE=` +diagnostic around regular `MUL_MAT` dispatch. It does not alter routing; it logs +the existing route decision for projection-heavy calls. + +Artifact: + +- `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `486c28c63` +- DGX mirror commit: `dgx:~/llama-phase6-source` `18f7ad005` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MUL_MAT_ROUTE_TRACE=8192` +- Workload: staggered n128 `llama-server`, `GEN=64` + +Route summary: + +| route | count | +|-------|-------| +| `mat_f` | 2888 | +| `op_cublas` | 2292 | +| `mmq` | 1328 | +| `vec_q` | 1214 | +| `vec_f` | 470 | + +Type summary: + +| type | meaning | count | +|------|---------|-------| +| 30 | BF16 | 3965 | +| 40 | NVFP4 | 2542 | +| 0 | F32 | 1685 | + +Top BF16 route/shape counts: + +| route | shape | count | +|-------|-------|-------| +| `mat_f` | `ne1=12 ne11=12 ne12=1 ne13=1` | 775 | +| `op_cublas` | `ne1=18 ne11=18 ne12=1 ne13=1` | 760 | +| `mat_f` | `ne1=8 ne11=8 ne12=1 ne13=1` | 570 | +| `op_cublas` | `ne1=36 ne11=36 ne12=1 ne13=1` | 380 | +| `mat_f` | `ne1=2 ne11=2 ne12=1 ne13=1` | 380 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` in all three gate runs | +| `MUL_MAT_ID` | ok | `806/806` in all three gate runs | + +Decision: + +- The first 8192 regular `MUL_MAT` calls in n128 serving are dominated by BF16 + direct `mat_f` and generic `op_cublas`, not batched cuBLAS. +- Next projection work should either add a cuBLAS/MMF subroute trace or test a + bounded BF16 route policy for the `op_cublas` shapes. Do not chase batched + cuBLAS for this measured serving slice. + +## Phase 36 cuBLAS Subroute Trace + +Phase 36 added patch `0062`, a default-off `LLAMA_CUBLAS_ROUTE_TRACE=` +diagnostic around the generic cuBLAS `MUL_MAT` path. It does not alter branch +behavior; it classifies existing calls as `nvfp4_bf16_tc`, `bf16_tc`, +`f16_tc_32f`, `f16_tc_16f`, or `sgemm`. + +Artifact: + +- `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `38c4ef2e4` +- DGX mirror commit: `dgx:~/llama-phase6-source` `e0224393a` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_CUBLAS_ROUTE_TRACE=8192` +- Workload: staggered n128 `llama-server` diagnostic trace + +Route summary: + +| route | count | +|-------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top shapes: + +| route | shape | count | +|-------|-------|------:| +| `bf16_tc` | `type=30 row_diff=32 src1_ncols=510 ne00=2048 ne10=2048` | 360 | +| `bf16_tc` | `type=30 row_diff=8192 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `bf16_tc` | `type=30 row_diff=2048 src1_ncols=510 ne00=4096 ne10=4096` | 240 | +| `sgemm` | `type=0 row_diff=256 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `sgemm` | `type=0 row_diff=1 src1_ncols=510 ne00=2048 ne10=2048` | 240 | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +Decision: + +- Phase 35's generic `op_cublas` bucket is BF16 tensor-core plus F32 SGEMM in + this serving slice. It is not NVFP4 cuBLAS and not batched cuBLAS. +- The next projection phase should identify whether the `type=0` SGEMM shapes + are expected glue tensors or a missed BF16 route. Do not change routing until + a separately gated policy proves md5/op safety. + +## Phase 37 cuBLAS Tensor-Name Trace + +Phase 37 added patch `0063`, extending the default-off +`LLAMA_CUBLAS_ROUTE_TRACE=` diagnostic with `src0`, `src1`, and `dst` tensor +names. It is instrumentation only. + +Artifact: + +- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227` + +Run: + +- Fork commit: `/home/mudler/_git/llama.cpp` `2d590d770` +- DGX mirror commit: `dgx:~/llama-phase6-source` `2cbb61969` +- Env: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_CUBLAS_ROUTE_TRACE=4096` +- Workload: staggered n128 `llama-server` diagnostic trace + +Route summary: + +| route | count | +|-------|------:| +| `bf16_tc` | 2884 | +| `sgemm` | 1212 | + +Named bucket summary: + +| route | tensor pattern | +|-------|----------------| +| `bf16_tc` | `blk.N.attn_gate.weight -> z-N` | +| `bf16_tc` | `blk.N.ssm_out.weight -> linear_attn_out-N` | +| `sgemm` | `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` | +| `sgemm` | `blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N` | + +Gates: + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +Decision: + +- The Phase 36 F32 SGEMM bucket is mainly MoE gate logits and shared-expert gate + projections, not an anonymous missed dense projection route. +- Do not blindly force these calls to BF16. First inspect the model-load tensor + types for `ffn_gate_inp*`; if changing weight dtype or graph routing is + considered, require md5/op gates and KL validation. + +## Phase 38 Gate Projection Policy + +Phase 38 is a safety and scope checkpoint before any `ffn_gate_inp*` route +change. It makes the reusable inference gate stricter by default and records why +the Phase 37 SGEMM bucket should not be treated as a missed BF16 route. + +Artifact: + +- `/home/mudler/bench/phase38_gate_baseline/20260701_084410` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE phase33-small-m-tile-policy-done 1782883234` | + +Fresh baseline gates against the current Phase37 build: + +| check | status | actual | +|-------|--------|--------| +| MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` | +| `MUL_MAT_ID` | ok | `806/806` | + +Source comparison: + +- `qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and + `ffn_gate_inp_shexp.weight` as `[n_embd]`. +- `llama-graph.cpp` computes router logits with `build_lora_mm(gate_inp, cur)` + and labels the result `ffn_moe_logits`. +- vLLM Qwen3-Next constructs both gates as `ReplicatedLinear(..., + quant_config=None)`, and its fused-MoE runner can concatenate router and + shared-expert gate weights for one fused-gate forward path. + +Decision: + +- The `sgemm` bucket is router/shared-expert gate math kept unquantized by both + engines. It is expected F32 policy, not an accidental cuBLAS fallback. +- Do not force BF16 or NVFP4 for `ffn_gate_inp*`. +- A future optimization can test a default-off fused gate projection that + preserves F32 math and split semantics. Gate it with MoE/dense md5, + `MUL_MAT`, `MUL_MAT_ID`, and KL validation if either md5 changes before any + serving benchmark. + +## Phase 39 Gate Fusion Feasibility + +Phase 39 checked whether the Phase38 follow-up should be a quick graph-time +fused gate projection. + +Artifacts: + +- `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227` +- `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` +- `/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis` + +Evidence: + +| source | result | +|--------|--------| +| Phase37 route trace | `sgemm=1212`, with per-layer `ffn_gate_inp.weight -> ffn_moe_logits` and `ffn_gate_inp_shexp.weight -> shared_expert_gate` entries | +| Phase27 serving profile | total kernel time `20.0372s` | +| Phase27 serving profile | `concat_layout=459.84ms` (`2.29%`, `2250` instances) | +| Phase27 serving profile | `cublas_bf16_gemm=1892.81ms` (`9.45%`) and `cutlass_bf16_gemm=684.01ms` (`3.41%`) | + +Decision: + +- Reject the quick graph-time fused gate shortcut based on `ggml_concat()` of + the two gate weights. `concat_layout` is already a measurable serving bucket, + so adding graph-time weight concatenation risks moving work into an existing + bottleneck before removing enough SGEMM overhead. +- The only acceptable future fused-gate design is a persistent/load-time F32 + combined gate weight, split by output views after one matmul. It must be + default-off, keep gate weights in F32, avoid graph-time weight concat, and + pass MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` gates before any serving + benchmark. If md5 changes, run KL first and reject on KL regression. + +## Phase 40 Max-Concurrency C1 Check + +Phase 40 tested the remaining C1 hypothesis from the lever map: use paged KV's +lower memory footprint to run a higher-concurrency serving point where vLLM +falls behind or fails to fit. + +Artifacts: + +- `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002` +- `/home/mudler/bench/phase40_max_concurrency/20260701_090012` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE phase39-gate-sgemm-profile-done 1782888737` | + +Harness change: + +- `paged-current-serving-snapshot.sh` now accepts `BUILD_DIR` and defaults + `BIN` from that same directory. This keeps the benchmark build step and runtime + binaries pointed at the same CMake tree. +- Phase 40 used `BUILD_DIR=$HOME/llama-phase6-source/build-phase36`, + `BIN=$HOME/llama-phase6-source/build-phase36/bin`, + `OPS=MUL_MAT,MUL_MAT_ID`, `PARALLEL=256`, `CTX=262144`, `PTOK=128`, + `GEN=64`, `NPL="128 192 256"`. + +Pre/post inference gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT` | ok | `1146/1146` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT` | ok | `1146/1146` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | +|-----|---|---------|----------------|--------------------|-------------|--------------| +| paged | 128 | `326.3` | `671.8` | `3.97` | `1695.2` | `8182.3` | +| paged | 192 | `318.3` | `679.9` | `2.50` | `1605.2` | `11151.6` | +| paged | 256 | `337.1` | `829.9` | `2.09` | `1525.7` | `15065.7` | +| vLLM | 128 | `654.4` | `1013.3` | `6.72` | `5206.0` | `2582.6` | +| vLLM | 192 | `697.7` | `1185.2` | `4.88` | `4787.1` | `3690.6` | +| vLLM | 256 | `714.1` | `1306.1` | `3.90` | `4471.0` | `5124.2` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` | +| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` | +| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` | + +Decision: + +- C1 does not close GB10 parity for this workload. Paged safely serves `n=256` + with canonical md5/op gates green before and after the run, but vLLM also + fits and remains materially faster. +- Do not claim a GB10 parity win from higher max concurrency at + `PTOK=128`, `GEN=64`, `n<=256`. +- The next GB10 work should stay on the profile-validated root causes: + prefill GDN, prefill MoE GEMM, and low-concurrency/full-step graph capture. + Any future C1 rerun must push beyond this tested point and keep the same + md5 plus `MUL_MAT`/`MUL_MAT_ID` gates. + +## Phase 41 Low-Concurrency Serving Check + +Phase 41 measured the opposite serving regime after Phase40 rejected the tested +max-concurrency shortcut: low concurrency and latency-sensitive decode. This is +the regime where any remaining host/scheduler gap should be most visible. + +Artifacts: + +- `/home/mudler/bench/phase41_low_concurrency_dryrun/20260701_091429` +- `/home/mudler/bench/phase41_low_concurrency/20260701_091437` + +Preflight: + +| check | actual | +|-------|--------| +| GPU | `NVIDIA GB10, 580.159.03` | +| docker containers | `0` | +| `local-ai-worker` containers | `0` | +| GPU compute apps | `0` | +| GPU lock owner | `FREE released-by-codex-current-serving-snapshot 1782889704` | + +Run shape: + +- `BUILD_DIR=$HOME/llama-phase6-source/build-phase36` +- `BIN=$HOME/llama-phase6-source/build-phase36/bin` +- `OPS=MUL_MAT,MUL_MAT_ID` +- `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"` + +Pre/post inference gates: + +| phase | check | status | actual | +|-------|-------|--------|--------| +| pre | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| pre | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| pre | `MUL_MAT` | ok | `1146/1146` | +| pre | `MUL_MAT_ID` | ok | `806/806` | +| post | MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post | dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post | `MUL_MAT` | ok | `1146/1146` | +| post | `MUL_MAT_ID` | ok | `806/806` | + +Serving result: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | +|-----|---|---------|----------------|--------------------|-------------|--------------| +| paged | 1 | `50.6` | `56.5` | `55.61` | `1221.5` | `131.8` | +| paged | 8 | `159.5` | `222.9` | `26.72` | `1438.8` | `835.9` | +| paged | 32 | `240.1` | `393.9` | `11.15` | `1615.7` | `2784.4` | +| vLLM | 1 | `67.5` | `75.4` | `74.14` | `1720.4` | `95.3` | +| vLLM | 8 | `251.8` | `296.5` | `36.12` | `4558.8` | `266.0` | +| vLLM | 32 | `454.6` | `592.4` | `17.43` | `5376.5` | `818.6` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` | +| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` | +| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` | + +Decision: + +- The low-concurrency gap is real, but Phase41 does not reopen D1/full-step graph + capture. Patch `0043` already ships that behavior default-on, and Phase34 + route tracing found `host_sync=0/4096` for the current n128 serving path. + Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at `n=32`. +- TTFT is the bigger user-visible low-concurrency gap, especially by `n=8/32`; + prefill GDN and MoE GEMM work therefore still matters even in a decode-focused + serving discussion. +- Do not fund another D1 graph-capture patch on GB10 unless a fresh route trace + first proves a host-sync fallback or graph-disable condition has returned. The + next implementation target should be a measured non-D1 bucket, gated by the + same md5 plus `MUL_MAT`/`MUL_MAT_ID` checks. + +## Phase 42 D1/GDN/GEMM Target Reconciliation + +Phase 42 challenged the Phase41 wording against the patch stack and read-only +subagent analysis. It resolves the next-target decision before any source work. + +Evidence: + +| track | evidence | decision | +|-------|----------|----------| +| D1/full-step graph capture | Patch `0043` is default-on for grouped MMQ decode and opt-out via `LLAMA_MOE_NO_FORCE_GRAPHS=1`; Phase34 route trace found `host_sync=0/4096`; `VLLM_PARITY_FINAL.md` marks D1 shipped and the host-sync premise refuted | closed on current GB10 path | +| S3 decode-shape-stable scheduling | Patch `0041` is shipped default-off after end-to-end A/B showed worse TTFT and lower throughput despite better per-step decode metrics | keep opt-in only | +| GDN prefill | Patches `0046`/`0047` are the shipped GB10 GDN wins; C32 slab, QS-early, and Global-Ai32 were md5-clean but slower | do not add another low-conflict GB10 GDN reorder | +| W4A16 / prefill GEMM | Patches `0033`/`0034`/`0035` are default-off; `0048`-`0050` improved forced W4A16 only marginally and did not beat default MMQ | do not add another small W4A16 body/metadata tweak | + +Next target: + +- The only small incremental candidate left from the current evidence is the + persistent/load-time F32 combined gate projection scoped in Phase38/39: + combine `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` once, run one + F32 gate matmul, and split/view the output. Do not use graph-time + `ggml_concat()`. +- It must be default-off, fork-first, and validated with MoE/dense md5, + `MUL_MAT`, `MUL_MAT_ID`, and KL if either md5 changes before any serving + benchmark. + +## Phase 43 Persistent Gate Fusion Feasibility + +Phase 43 checked whether the Phase42 "small source candidate" can really be +implemented as a low-conflict persistent/load-time combined gate tensor. + +Source facts: + +| path | finding | +|------|---------| +| `src/models/qwen35moe.cpp` | `ffn_gate_inp.weight` is loaded as `[n_embd, n_expert]`; `ffn_gate_inp_shexp.weight` is loaded separately as `[n_embd]` | +| `src/models/qwen35moe.cpp` | the routed gate is consumed inside `build_moe_ffn(...)`; the shared-expert gate is consumed later as a separate `build_lora_mm(ffn_gate_inp_shexp, cur)` | +| `src/llama-model-loader.cpp` | `create_tensor(...)` duplicates tensors from GGUF metadata and allocates backend buffers before `load_all_data(...)`; it has `create_tensor_as_view(...)` for views of existing GGUF tensors, not for new persistent derived tensors | +| `src/llama-model.cpp` | backend buffers are allocated from loader contexts before tensor data is loaded; adding a new persistent derived weight requires a new derived-weight allocation/materialization path, not a local Qwen graph change | + +Decision: + +- Reject persistent/load-time fused gate projection as a "small" GB10 shortcut. + It is only low-conflict if the combined weight already exists in the GGUF, or + if llama.cpp gains a general derived-weight facility. Neither is true in the + current fork. +- Do not fall back to graph-time `ggml_concat()`; Phase39 already rejected that + because `concat_layout` is measurable in serving. +- Do not implement a Qwen-only loader hack that reads both tensors back to host, + allocates an extra backend weight buffer, and patches layer pointers after + load. That is high conflict surface for a gate-only SGEMM bucket and would need + new lifetime/state-management tests across mmap, offload, split buffers, and + MTP blocks. +- The remaining GB10 parity work is no longer a shortcut patch. It is either a + larger funded kernel/loader effort with its own design, or a hardware pivot + benchmark. Any future implementation still needs the canonical MoE/dense md5, + `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates before benchmarking. + +## Phase 44 Hardware-Pivot Harness Readiness + +Phase 44 prepares the audited current-stack serving snapshot for hardware-pivot +runs without editing the harness between hosts. This is a harness-only change: +it does not modify llama.cpp inference code, patch-series source, md5 gates, op +gates, or any benchmark result. + +New vLLM serving overrides: + +| variable | default | vLLM flag | +|----------|---------|-----------| +| `VLLM_GPU_MEMORY_UTILIZATION` | `0.85` | `--gpu-memory-utilization` | +| `VLLM_MAX_MODEL_LEN` | `4096` | `--max-model-len` | +| `VLLM_MAX_NUM_SEQS` | `256` | `--max-num-seqs` | +| `VLLM_TENSOR_PARALLEL_SIZE` | `1` | `--tensor-parallel-size` | +| `VLLM_EXTRA_ARGS` | empty | whitespace-split args appended to `vllm serve` | + +Verification scope: + +- Red help-text check first proved `VLLM_MAX_NUM_SEQS` was absent from + `paged-current-serving-snapshot.sh --help`. +- Red DGX dry-run check first proved the harness did not print + `VLLM_MAX_NUM_SEQS=512` when the override was supplied. +- Green checks after the patch included `bash -n`, help-text grep, and DGX + `DRY_RUN=1` preflight with the override values printed before any server + starts. Artifact: + `/home/mudler/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038`. + +Decision: + +- Use the same audited harness for a future datacenter-Blackwell or other + non-GB10 parity snapshot by overriding vLLM limits in the environment instead + of editing the script. +- This does not reopen GB10 shortcut work and does not claim parity. A real + hardware-pivot benchmark still needs the normal preflight, `hardware.txt`, + pre/post MoE/dense md5 gates, `MUL_MAT`/`MUL_MAT_ID` checks, and + KL-if-md5-changes before interpreting throughput. + +## Phase 45 Inference Gate Guard + +Phase 45 answers the inference-safety question after the harness-only Phase44 +change by running the canonical paged inference gates on DGX. This is a +gate-only phase: it does not benchmark serving throughput and does not change +inference code. + +Artifact: + +- `/home/mudler/bench/phase45_inference_gate_guard/20260701_094320` + +Preflight: + +- Docker containers: `0` +- `local-ai-worker` containers: `0` +- GPU compute apps: `0` +- GPU lock owner: `FREE released-by-codex-current-serving-snapshot 1782890417` + +Gate command: + +```bash +BIN=$HOME/llama-phase6-source/build-phase36/bin \ +ART=$HOME/bench/phase45_inference_gate_guard/20260701_094320 \ +OPS=MUL_MAT,MUL_MAT_ID \ +~/paged-inference-gates.sh +``` + +Results: + +| check | result | +|-------|--------| +| MoE paged md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| Dense paged md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` backend op | `1146/1146`, `Backend CUDA0: OK` | +| `MUL_MAT_ID` backend op | `806/806`, `Backend CUDA0: OK` | + +Decision: + +- Current DGX phase36 build still passes the canonical inference md5/op gates. +- Phase44 did not touch inference code; Phase45 provides the post-change guard + artifact for future handoff and comparison. + +## Phase 46 Served-Model-Name Harness Readiness + +Phase 46 removes the remaining hardcoded `q36` model name from the audited +serving snapshot harness. This is a harness-only hardware-pivot readiness +change: it does not change llama.cpp inference code, patch-series source, md5 +gates, op gates, or any throughput result. + +New override: + +| variable | default | used for | +|----------|---------|----------| +| `SERVED_MODEL_NAME` | `q36` | vLLM `--served-model-name`, vLLM readiness check, and h2h `--model` requests for both paged and vLLM arms | + +Verification: + +- Red help-text check first proved `SERVED_MODEL_NAME` was absent from + `paged-current-serving-snapshot.sh --help`. +- Red DGX dry-run check first proved the harness did not print + `SERVED_MODEL_NAME=dense-q36` when supplied. +- Green checks after the patch included `bash -n`, help-text grep, a source grep + proving no hardcoded `q36` serve/request names remain in the harness, and DGX + `DRY_RUN=1` preflight with the override value printed before any server + starts. Artifact: + `/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`. + +Decision: + +- Future dense, MoE, or hardware-pivot snapshots can keep the same audited + harness while setting model paths and the served OpenAI model name from the + environment. +- This does not claim a new parity result. Full runs still require the normal + preflight, `hardware.txt`, pre/post md5 gates, `MUL_MAT`/`MUL_MAT_ID`, and + KL-if-md5-changes gates before interpreting throughput. + +## Phase 47 Dense Serving Snapshot Attempt + +Phase 47 attempted to use the Phase46 model-name override for a dense +paged-vs-vLLM serving snapshot. The first full attempt is incomplete and must +not be used as a dense parity result. + +Artifacts: + +- Dry-run: `/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141` +- Incomplete full attempt: + `/home/mudler/bench/phase47_dense_serving/20260701_095151` + +Run shape: + +- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf` +- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm` +- `SERVED_MODEL_NAME=dense-q36` +- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64` +- `OPS=MUL_MAT,MUL_MAT_ID` + +Completed before failure: + +- Preflight was clean: docker `0`, `local-ai-worker` `0`, GPU compute `0`. +- Pre-gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense + md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, + `MUL_MAT_ID` `806/806`. +- Paged dense arm completed through `n=128`: + +| n | paged decode agg t/s | paged per-seq t/s | paged agg t/s | paged TTFT ms | +|---|----------------------|-------------------|----------------|---------------| +| 1 | `13.3` | `13.14` | `12.5` | `312.3` | +| 8 | `85.5` | `10.35` | `62.5` | `2068.5` | +| 32 | `198.1` | `5.44` | `105.1` | `7608.5` | +| 128 | `361.8` | `1.89` | `143.0` | `20501.7` | + +Failure/root cause: + +- vLLM dense startup exceeded the old fixed `240` one-second readiness budget. + The server log showed weight loading alone took about `199.43s`, followed by + compile, autotune, CUDA graph capture, and multimodal warmup before the server + began listening. +- `vllm/models.json` is empty and `models.json.err` contains an initial + connection failure, so no vLLM result JSONs were produced. +- Cleanup then waited on the vLLM server PID after `SIGTERM`; manual cleanup was + required. DGX was returned to idle with owner + `FREE released-by-codex-phase47-cleanup 1782892962`. + +Decision: + +- Treat this artifact as a harness failure investigation, not a benchmark. +- Retry Phase47 only after the Phase48 readiness/cleanup hardening is present. + +## Phase 47 Dense Serving Snapshot Retry + +After Phase48 hardening, Phase47 was retried and completed successfully. + +Artifact: + +- `/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` + +Run shape: + +- `MODEL=$HOME/bench/q36-27b-nvfp4.gguf` +- `VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm` +- `SERVED_MODEL_NAME=dense-q36` +- `NPL="1 8 32 128"`, `PARALLEL=128`, `CTX=131072`, `PTOK=128`, `GEN=64` +- `OPS=MUL_MAT,MUL_MAT_ID`, `VLLM_READY_ATTEMPTS=700` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Results: + +| arm | n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT ms | +|-----|---|---------|-----------------|---------------------|-------------|---------| +| paged | 1 | `12.5` | `13.3` | `13.11` | `515.1` | `312.5` | +| vLLM | 1 | `9.6` | `9.9` | `9.72` | `983.6` | `166.7` | +| paged | 8 | `61.8` | `85.2` | `10.39` | `579.5` | `2201.4` | +| vLLM | 8 | `67.6` | `73.7` | `9.04` | `2147.7` | `544.0` | +| paged | 32 | `105.9` | `198.7` | `5.44` | `595.8` | `7442.7` | +| vLLM | 32 | `171.7` | `219.9` | `6.49` | `2094.4` | `2041.9` | +| paged | 128 | `139.6` | `360.8` | `1.86` | `608.1` | `21177.2` | +| vLLM | 128 | `275.3` | `456.0` | `2.89` | `1889.6` | `6615.7` | + +Ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` | +| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` | +| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` | +| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` | + +Decision: + +- Dense decode is ahead of vLLM at low concurrency (`n=1/8`) but falls behind + at `n=32/128`; this mirrors the broader conclusion that low-N decode can be + strong while prefill/TTFT and higher-concurrency serving remain gaps. +- Dense TTFT remains much worse than vLLM at all tested concurrency points, so + dense serving does not change the GB10 conclusion or reopen closed shortcut + work. + +## Phase 48 Serving Harness Readiness Hardening + +Phase 48 fixes the harness behavior exposed by the failed dense snapshot +attempt. It is a harness reliability change, not an inference change. + +Changes: + +- Add `LLAMA_READY_ATTEMPTS` (default `240`) and `VLLM_READY_ATTEMPTS` (default + `600`) so slow vLLM model load/compile paths can be pre-budgeted. +- Bound each HTTP readiness probe with `curl --max-time 2` so a single probe + cannot hang the readiness loop. +- Replace direct `kill` plus unbounded `wait` with `stop_server_pid`, which + sends `SIGTERM`, waits up to 30 seconds, then sends `SIGKILL` before `wait`. +- Use the bounded cleanup helper for normal paged teardown, normal vLLM + teardown, and error-path `release_lock`. + +Verification: + +- Red checks first proved `VLLM_READY_ATTEMPTS`, bounded curl, and hard-kill + cleanup were absent. +- Green checks after the patch included `bash -n`, help-text grep, grep for + `curl --max-time 2 -fsS "$url"`, grep for `kill -9 "$SERVER_PID"`, and a DGX + dense dry-run with `VLLM_READY_ATTEMPTS=700`. +- DGX dry-run artifact: + `/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. + +## Phase 49 vLLM Env Hygiene + +Phase 49 cleans up benchmark log noise observed during the Phase47 retry. vLLM +warned about harness-owned environment variables such as `VLLM_READY_ATTEMPTS` +and `VLLM_MODEL` because they were inherited by the `vllm serve` process. + +Change: + +- Wrap `vllm serve` with `env -u` for harness-owned variables: + `VLLM_MODEL`, `VLLM_BIN`, `VLLM_READY_ATTEMPTS`, + `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, + `VLLM_TENSOR_PARALLEL_SIZE`, and `VLLM_EXTRA_ARGS`. +- Keep intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. + +Verification: + +- Red grep first proved the scrub was absent. +- Green checks after the patch included `bash -n`, grep for `-u VLLM_MODEL`, + and a DGX dense dry-run with `VLLM_READY_ATTEMPTS=700`. +- DGX dry-run artifact: + `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. + +## Phase 50 Dense True Decode Profile + +Phase 50 separates dense high-concurrency decode from the Phase47 h2h serving +window. The Phase47 h2h `decode_agg_tps` metric can count tokens generated by +early requests while later requests are still in prefill, then divide by a +window that starts at the last first-token. That is useful serving telemetry, +but it is not a pure steady-decode measurement. + +Artifact: + +- `/home/mudler/bench/phase50_dense_true_decode/20260701_103120` + +Preflight: + +- Docker containers: `0` +- `local-ai-worker`: `0` +- GPU compute apps: `0` +- GPU: `NVIDIA GB10`, driver `580.159.03` + +Inference gates: + +| phase | build | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|-------|---------|-----------|-----------|--------------| +| pre | `build-phase36` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `build-cuda` | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +`build-phase36/bin` had the completion and op-test binaries but not +`llama-batched-bench`, so the actual profiled llama.cpp decode binary came from +`~/llama-phase6-source/build-cuda/bin`. That build was gated before and after +the profile. + +Profile method: + +- Shape: dense Qwen3.5, `npl=128`, `npp=128`, `ntg=16` and `ntg=64`. +- Paged command: `llama-batched-bench` with `LLAMA_KV_PAGED=1`, + `LLAMA_MOE_FORCE_GRAPHS=1`, `-c 131072 -b 2048 -ub 512 -ngl 99 -fa on`. +- vLLM command: in-process `LLM.generate`, `max_model_len=4096`, + `max_num_seqs=256`, `gpu_memory_utilization=0.85`, prefix caching disabled. +- Both profiles used `nsys --cuda-graph-trace=node`. +- Difference method: `(ntg64 tokens - ntg16 tokens) / (ntg64 wall - ntg16 wall)`. + +Results: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +Interpretation: + +- Dense true decode at `n=128` is about `88.2%` of vLLM, not the `79.1%` + implied by Phase47 h2h aggregate decode. The Phase47 serving window therefore + includes real scheduler/accounting effects in addition to GPU decode speed. +- There is still a real dense GPU-steady decode gap of about `12%`, but it is + not large enough to explain Phase47 aggregate serving (`50.7%` of vLLM) or + TTFT (`3.20x` vLLM) by itself. +- The next low-conflict code phase should add an opt-in serving + batch-composition/admission trace around `server_context::pre_decode()` to + measure decode tokens admitted, prompt tokens admitted, waiting prompt slots, + graph reuse, and prefill starvation. Do not start with another GDN or GEMM + rewrite unless that trace rules the scheduler out. + +## Phase 51 Serving Admission Trace + +Phase 51 implements the Phase50 next step in the llama.cpp fork. This is a +trace-only change, gated behind `LLAMA_SERVING_TRACE=1`; default inference and +batch scheduling are unchanged. + +Fork commit: + +- `/home/mudler/_git/llama.cpp` `localai-paged` +- `c6cb8460e feat(server): trace serving admission batches` + +Change: + +- Add `tools/server/server-admission-trace.h` with a small accumulator and + formatter. +- Add `tests/test-server-admission-trace.cpp` and CMake target coverage. +- Wire counters into `server_context_impl::pre_decode()` for: + decode tokens already in the batch, prompt tokens admitted, waiting prompt + slots, started/continued prompt slots, decode-only steps, `n_batch`, + `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`. +- Print one aggregate summary when the server context is destroyed, only when + `LLAMA_SERVING_TRACE=1` and at least one scheduler step was observed. + +Verification: + +- Red test first: `test-server-admission-trace` failed to build before + `server-admission-trace.h` existed. +- Local fork: `test-server-admission-trace` built and passed, `llama-server` + built, and `ctest --test-dir build -R '^test-server-admission-trace$'` + passed. +- DGX artifact: + `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` +- DGX `build-cuda`: `test-server-admission-trace` and `llama-server` built; + CTest passed. +- DGX inference gates on the patched `build-cuda` build passed: MoE md5 + `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and + `MUL_MAT_ID` `806/806`. + +Mirror status: + +- The fork commit is local and DGX-gated. +- The LocalAI `patches/paged/` series is not regenerated yet because the + handoff requires pushing the fork branch first, and pushes require explicit + approval. + +## Phase 52 Dense Admission Trace + +Phase 52 uses the Phase51 trace to capture the actual dense `n=128` serving +admission shape. The Phase51 patch was applied temporarily to the clean DGX +mirror, built, gated, used for the trace, and then reverted from the mirror. + +Artifact: + +- `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Clean run shape: + +- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf` +- `LLAMA_SERVING_TRACE=1` +- `N=128`, `PTOK=128`, `GEN=64` +- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512` + +H2H result: + +| n | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | +|---|---------|-----------------|---------------------|-------------|--------------|--------| +| 128 | `139.0` | `360.5` | `1.93` | `629.5` | `23171.5` | `58.921` | + +Admission trace: + +| steps | decode-only steps | decode tokens | prompt tokens | waiting prompt slots | max waiting prompt slots | started prompt slots | continued prompt slots | +|-------|-------------------|---------------|---------------|----------------------|--------------------------|----------------------|------------------------| +| `76` | `0` | `8064` | `22785` | `267` | `35` | `128` | `139` | + +Derived values: + +- `prompt_tokens` matched h2h `prompt_tok_total` exactly: `22785`. +- `decode_tokens` were `128` fewer than h2h `gen_total`, which is expected for + one first-token transition per request. +- Average prompt tokens per scheduler step: `299.8`. +- Average decode tokens per scheduler step: `106.11`. +- Average waiting prompt slots per scheduler step: `3.51`. +- `prefill_budget_step=0` and `prefill_cap_per_slot=0`, confirming the default + stock n-batch-only prompt admission path. + +Decision: + +- The default dense `n=128` scheduler emits no pure decode steps + (`decode_only_steps=0`) and admits prompt work across mixed steps. That + explains why Phase47 h2h serving decode can lag the Phase50 true-decode ratio: + serving is shaped by mixed prompt/decode admission and TTFT, not just dense + decode kernels. +- The next code phase should be a small, default-off scheduler A/B or a richer + per-step histogram trace to test whether prefill chunking/admission can reduce + TTFT without regressing aggregate throughput. Do not move to another GDN/GEMM + rewrite until this scheduler hypothesis is tested. + +## Phase 53 Admission Budget Sweep + +Phase 53 tests the existing default-off admission knobs exposed by patch 0016: +`LLAMA_MAX_BATCH_TOKENS` and `LLAMA_PREFILL_CAP`. The question was whether a +simple smaller token budget improves dense `n=128` TTFT or aggregate throughput. + +Artifact: + +- `/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Results: + +| variant | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | steps | max waiting prompt slots | +|---------|---------|-----------------|---------------------|-------------|--------------|--------|-------|--------------------------| +| default Phase52 | `139.0` | `360.5` | `1.93` | `629.5` | `23171.5` | `58.921` | `76` | `35` | +| `LLAMA_MAX_BATCH_TOKENS=1536 LLAMA_PREFILL_CAP=512` | `134.4` | `376.7` | `1.82` | `607.0` | `22263.7` | `60.968` | `81` | `26` | +| `LLAMA_MAX_BATCH_TOKENS=1024 LLAMA_PREFILL_CAP=512` | `130.0` | `392.4` | `1.82` | `565.2` | `23234.3` | `63.003` | `89` | `16` | + +Decision: + +- Smaller admission budgets reduce the maximum number of waiting prompt slots + and raise the h2h `decode_agg_tps` metric, but they reduce aggregate + throughput and prefill throughput. +- `T=1536` gave only a small TTFT improvement (`23171.5 -> 22263.7 ms`) while + worsening wall time and aggregate throughput. +- `T=1024` worsened TTFT and aggregate throughput despite the highest + `decode_agg_tps`. +- Do not promote simple budget shrinkage as a parity lever. The next useful + scheduler work is a richer per-step histogram trace or a targeted first-token + admission policy, not a static lower `LLAMA_MAX_BATCH_TOKENS`. + +## Phase 54 Admission Histogram Trace + +Phase 54 extends the Phase51 trace with compact per-step histograms for prompt +tokens, decode tokens, and waiting prompt slots. This is still trace-only and +default-off behind `LLAMA_SERVING_TRACE=1`; it does not change scheduling or +inference. + +Fork commits: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` + +Artifact: + +- `/home/mudler/bench/phase54_admission_hist_trace/20260701_113201` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Focused test/build: + +- Red test first: histogram assertions failed before implementation. +- Local fork: `test-server-admission-trace` passed, CTest passed, and + `llama-server` built. +- DGX `build-cuda`: `test-server-admission-trace` passed under CTest after the + temporary Phase51+Phase54 patch stack was applied. + +Phase52-aligned dense trace: + +- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf` +- `LLAMA_SERVING_TRACE=1` +- `N=128`, `PTOK=168`, `GEN=64` +- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512` + +H2H result: + +| n | prompt tokens | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | wall s | +|---|---------------|---------|-----------------|---------------------|-------------|--------------|--------| +| 128 | `22913` | `138.1` | `360.2` | `1.92` | `626.7` | `23393.2` | `59.303` | + +Trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +Interpretation: + +- The Phase54 run matches the Phase52 serving envelope: same `76` steps, same + `8064` trace decode tokens, same `267` waiting prompt slots, and throughput + within noise. +- `63/76` steps have `prompt_tokens=0` and `waiting_prompt_slots=0`. +- Prompt admission is concentrated in a small number of very large chunks: + `prompt_hist=513+:12`. +- Decode is mostly full-width during active decode: + `decode_hist=128-255:53`. +- The scheduler still emits no pure decode-only steps for this shape. + +Decision: + +- The histogram strengthens the Phase53 rejection of static lower batch + budgets. The issue is not a uniformly oversized prompt budget every step; + prompt work arrives in a few large chunks and first-token latency remains high. +- The next scheduler A/B should be a targeted first-token admission or prompt + front-loading policy that changes when first prompt chunks are admitted, while + keeping md5/op gates unchanged. Do not reduce `LLAMA_MAX_BATCH_TOKENS` globally + as the next parity lever. + +Mirror status: + +- Both trace commits are local and DGX-gated. +- The LocalAI `patches/paged/` series is not regenerated yet because the + handoff requires pushing the fork branch first, and pushes require explicit + approval. + +## Phase 55 TTFT Prefill-First Scheduler A/B + +Phase 55 tests the first targeted scheduler policy after Phase53 rejected static +budget shrinkage. The policy is default-off behind +`LLAMA_TTFT_PREFILL_FIRST=1`: while any prompt is still waiting for first-token +admission, defer token 2+ decode rows from already-started streams. This shifts +early compute toward prompt admission without lowering `LLAMA_MAX_BATCH_TOKENS`. + +Fork commits in the local stack: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` +- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode` + +Artifact: + +- `/home/mudler/bench/phase55_ttft_prefill_first/20260701_114929` + +Pre/post/after-A-B gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| after A/B | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +Focused test/build: + +- Red test first: `test-server-admission-policy` failed because + `server-admission-policy.h` did not exist. +- Local fork: `test-server-admission-policy` and + `test-server-admission-trace` passed, CTest passed, and `llama-server` built. +- DGX `build-cuda`: policy and trace tests passed under CTest after the + temporary Phase51+Phase54+Phase55 patch stack was applied. + +Dense A/B shape: + +- Dense GGUF: `~/bench/q36-27b-nvfp4.gguf` +- `LLAMA_SERVING_TRACE=1` +- `N=128`, `PTOK=168`, `GEN=64` +- `CTX=131072`, `PARALLEL=128`, `BATCH=2048`, `UBATCH=512` + +H2H result: + +| variant | agg t/s | decode agg t/s | decode per-seq t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|---------------------|-------------|--------------|-------------|--------| +| default | `138.2` | `361.3` | `1.91` | `626.0` | `23231.9` | `36599.5` | `59.272` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `142.9` | `336.9` | `1.86` | `694.2` | `21520.8` | `33008.2` | `57.323` | + +Delta: + +- Aggregate throughput: `+3.4%` +- Prefill throughput: `+10.9%` +- Mean TTFT: `-7.4%` +- Max TTFT: `-9.8%` +- Wall time: `-3.3%` +- h2h decode-agg: `-6.8%` + +Default trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=0 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +Opt-in trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=35 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=660 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,257-512:1,513+:11 decode_hist=0:13,128-255:63 waiting_hist=0:63,1-7:1,8-15:3,16-31:8,32-63:1 +``` + +Decision: + +- Keep Phase55 as a promising default-off scheduler A/B. It improves TTFT and + aggregate throughput on the dense `n=128` serving shape while all md5/op gates + remain green. +- The drop in h2h `decode_agg_tps` is expected because the policy intentionally + defers token 2+ decode rows during prompt backlog. It should not be treated as + a correctness or kernel regression without a true decode profile. +- Next phase should test the same opt-in policy on the MoE serving shape and at + another concurrency point before any default-on discussion. + +Mirror status: + +- The Phase55 fork commit is local and DGX-gated. +- The LocalAI `patches/paged/` series is not regenerated yet because the fork + branch still requires explicit push approval first. + +## Phase 56 TTFT Prefill-First Validation + +Phase 56 validates the Phase55 opt-in policy outside dense `n=128`. It makes no +code changes; the same Phase51+Phase54+Phase55 stack was applied temporarily to +the clean DGX mirror and reverted after the run. + +Artifact: + +- `/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots | +|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------| +| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` | `0` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` | `441` | + +MoE deltas: + +- Aggregate throughput: `-0.4%` +- Prefill throughput: `+4.3%` +- Mean TTFT: `+6.2%` +- Max TTFT: `-4.1%` +- Wall time: `+0.3%` + +Dense `n=32`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred decode slots | +|---------|---------|-----------------|-------------|--------------|-------------|--------|-----------------------| +| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` | `0` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` | `34` | + +Dense `n=32` deltas: + +- Aggregate throughput: `+2.3%` +- Prefill throughput: `+7.3%` +- Mean TTFT: `-5.2%` +- Max TTFT: `-6.8%` +- Wall time: `-2.2%` + +Decision: + +- Keep `LLAMA_TTFT_PREFILL_FIRST=1` as an opt-in A/B only. It helps dense + `n=128` and dense `n=32`, but MoE `n=128` regresses mean TTFT and slightly + regresses aggregate throughput. +- Do not make this policy default-on or promote it as a universal parity lever. + The next scheduler work should either narrow the policy to dense/non-MoE + shapes or add a more selective condition that avoids the MoE mean-TTFT + regression. + +## Phase 57 TTFT Prefill-First Cap Sweep + +Phase 57 adds an optional per-step cap to the Phase55 opt-in policy: +`LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` preserves the Phase55 +unlimited behavior. The goal was to keep some first-token relief while avoiding +the MoE `n=128` mean-TTFT regression from Phase56. + +Fork commit: + +- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral` + +Artifact: + +- `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` | + +Decision: + +- Reject capped TTFT defer as a parity lever. MoE cap32 improves mean TTFT + versus same-window default (`7425.5 -> 6994.0 ms`) but still loses aggregate + throughput and wall time. Dense caps improve or preserve TTFT only by losing + aggregate throughput and wall time. +- Keep the cap as an A/B knob only; do not promote it as a default or parity + path. + +## Phase 58 TTFT Prefill-First Waiting Threshold + +Phase 58 adds `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`, a prompt-backlog threshold +for the Phase55 opt-in policy. Unset or `0` preserves prior behavior. The goal +was to activate decode deferral only during high prompt-backlog windows instead +of for the entire prompt backlog lifetime. + +Fork commit: + +- `8759213e3 feat(server): gate TTFT defer by prompt backlog` + +Artifact: + +- `/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` | `0` | +| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` | `323` | +| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` | `220` | +| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` | `140` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` | `0` | +| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` | `420` | +| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` | `386` | + +Decision: + +- Keep `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as the best selective + scheduler A/B so far, but still opt-in. On MoE `n=128`, min32 improved + aggregate throughput (`339.0 -> 341.9`), mean TTFT (`7743.1 -> 7420.1 ms`), + max TTFT (`11532.5 -> 11054.6 ms`), and wall time (`24.167 -> 23.950 s`). +- Dense `n=128` min32 was mixed: mean/max TTFT improved slightly, but aggregate + and wall regressed slightly. Do not default-on yet. +- Next step should repeat the MoE min32 result and run the matching vLLM h2h + comparison before treating this as real parity progress rather than run noise. + +## Phase 59 MoE Min32 Repeat and vLLM H2H + +Phase 59 repeats the Phase58 MoE min32 point and compares it to a matching vLLM +serving run. The Phase51+Phase54+Phase55+Phase57+Phase58 stack was applied +temporarily to the clean DGX mirror for the llama.cpp runs, then reverted before +the vLLM run. + +Artifact: + +- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147` + +Pre/post llama gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a | + +Llama min32 repeat versus llama default: + +- Aggregate throughput: `+0.1%` +- Mean TTFT: `-8.1%` +- Max TTFT: `-2.7%` +- Wall time: `-0.1%` +- Prefill throughput: `+2.8%` +- Decode aggregate throughput: `-2.3%` + +Llama min32 versus vLLM: + +- Aggregate throughput ratio: `0.560` +- Mean TTFT: llama is `2.415x` slower +- Wall time: llama is `1.793x` slower +- Prefill throughput ratio: `0.430` +- Decode aggregate throughput ratio: `0.673` + +Decision: + +- The min32 repeat confirms a real, inference-gated llama.cpp scheduler QoS + improvement for MoE `n=128`: mean TTFT drops without material aggregate or + wall-time loss. +- It does not close parity with vLLM. vLLM remains much faster on the same + request shape, especially prefill throughput and TTFT. +- Keep `LLAMA_TTFT_PREFILL_FIRST=1` plus + `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in. Do not default it yet. +- Treat this as latency tuning, not the next parity track. The larger gap is + still prefill / MoE compute. + +## Phase 60 Current W4A16 Prefill Profile + +Phase 60 re-profiles the current clean W4A16 grouped MoE prefill path after the +Phase1-5 W4A16 work, to decide whether another low-conflict W4A16 patch is +justified. + +Artifact: + +- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915` + +Pre/post gates: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +MoE `llama-batched-bench`, `npl=32`, `ntg=4`: + +| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s | +|------|----|----------|--------|----------|-------------| +| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` | +| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` | +| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` | +| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` | + +Forced W4A16 remains `0.623x` default FP4-MMQ at `npp=512` and `0.612x` at +`npp=2048`. + +`npp=512` profile: + +| path | top bucket | time % | total time | +|------|------------|--------|------------| +| default FP4-MMQ | `mul_mat_q` | `39.2%` | `2.712s` | +| default FP4-MMQ | `quantize_mmq_nvfp4` | `4.5%` | `0.314s` | +| forced W4A16 | `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` | +| forced W4A16 | `k_get_rows_float` | `11.2%` | `1.094s` | +| forced W4A16 | `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` | +| forced W4A16 | residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` | + +Decision: + +- Reject another small W4A16 body/metadata/cast tweak as the next parity phase. + The current forced path avoids most activation quantization, but the grouped + W4A16 kernel itself is `1.53x` slower than default MMQ's main `mul_mat_q` + bucket at `npp=512`, and sorted activation gathers add another `1.094s`. +- Eliminating the cast kernel entirely would recover only `5.3%` of the forced + W4A16 profile, not the `37-39%` end-to-end S_PP loss. +- Any future W4A16 parity work must be a larger redesign that improves the + grouped kernel body and removes or fuses the sorted activation gather. Do not + reopen the low-conflict micro-patch track. + +## W4A16 Direct-Activation Phase61 Result + +Phase61 tested the larger W4A16 direct-activation redesign. It passed default +inference gates and opt-in direct-A correctness: + +- Default gates artifact: + `/home/mudler/bench/phase61_direct_default_gates/20260701_132057` +- A/B artifact: `/home/mudler/bench/phase61_direct_ab/20260701_132237` +- Default MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Default dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` +- `MUL_MAT_ID`: `806/806` +- Forced W4A16 and direct-A MoE md5: + `07db32c2bcb78d17a43ed18bc22705cd` + +The direct path had to mirror `get_rows_cuda` flat-row source addressing. A +token/slot decode of `ids_to_sorted` failed `b=1` NVFP4 op cases; flat +`src_row*nb11` addressing fixed the gate. + +MoE prefill A/B (`npl=32`, `ntg=4`): + +| path | npp512 S_PP | npp2048 S_PP | +|------|-------------|--------------| +| default FP4-MMQ | `2325.45` | `2423.18` | +| forced W4A16 | `1471.05` | `1502.46` | +| forced W4A16 direct-A | `1566.30` | `1605.82` | + +Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` and `+6.9%`, +and still reached only `0.67x` / `0.66x` of default FP4-MMQ. The rejected direct +kernel diff was saved to `/tmp/phase61-w4a16-direct-a-rejected.diff` and not +committed. Do not continue W4A16 body tuning on GB10 as the next parity lever. + +## MTP Verify-Cost Phase62 Result + +Phase62 is recorded in +`docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md`. +It was a measurement and decision phase, not an MTP enablement phase. + +The phase exists because the current code already has the required speculative +acceptance telemetry: + +- server summary lines report draft acceptance, mean acceptance length, and + acceptance rate per position, +- `common_speculative_print_stats()` reports generated and accepted draft tokens, + mean acceptance length, and per-position acceptance, +- `LLAMA_SPEC_SHAPE_TRACE=1` reports decode and verify row shapes. + +Phase62 must preserve default inference with pre/post gates: + +- MoE paged md5 `8cb0ce23777bf55f92f63d0292c756b0`, +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, +- `MUL_MAT_ID` `806/806`. + +Artifact: + +- `/home/mudler/bench/phase62_mtp_verify_cost/20260701_134125` + +Pre/post gates passed: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK +paged inference gates OK +``` + +Serving result: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 248.5 | 104.4 | 42.0% | 1150.4 | 1682.9 | +| 32 | 411.8 | 112.8 | 27.4% | 2607.9 | 4444.7 | +| 128 | 696.5 | 148.1 | 21.3% | 7425.2 | 20155.8 | + +MTP acceptance was not the blocker: + +```text +#gen tokens = 9340 +#acc tokens = 7372 +acceptance = 0.78929 +#mean acc len = 3.33 +#acc rate/pos = (0.877, 0.767, 0.691) +graphs reused = 1 +``` + +Shape trace: + +```text +rows total 3212; rows=4: 3070 (95.6%) +draft total 3212; draft=3: 3070 (95.6%) +batch_after total 3212; unique values 495 +``` + +Decision: + +- Reject another MTP implementation phase for now. +- Phase62 kept default inference green with md5/op gates, but MTP remains + rejected unless a later design removes target-verify/output-row graph cost. +- Do not tune `spec-draft-n-max` blindly. Phase15, Phase19, and Phase62 all + showed high acceptance with poor serving throughput, so the remaining question + is verify cost, not whether MTP can draft. + +## Prefill Bucket Attribution Phase63 Result + +Phase63 is recorded in +`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`. +It was a measurement and decision phase, not a source patch phase. + +Artifact: + +- `/home/mudler/bench/phase63_prefill_bucket/20260701_140127` + +Pre/post inference gates passed: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +llama.cpp MoE prefill, `npl=32`, `ntg=4`: + +| npp | S_PP | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA | +|-----|------|--------------|-----|-----------|-------------|-----------|--------------|--------|----| +| 512 | `2248.20` | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` | +| 2048 | `2385.22` | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` | + +vLLM MoE prefill, `NSEQ=32`, `GEN=1`, `NREP=3`, eager profile path: + +| PT | S_PP | ew/glue | GDN | FA | bf16-proj | MoE-dispatch | top unclassified | +|----|------|---------|-----|----|-----------|--------------|------------------| +| 512 | `5315.6` | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` | +| 2048 | `5384.4` | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` | + +Decision: + +- Reject a Phase63 paged FlashAttention mask/block-table source patch. llama.cpp + FA is only `1.18%` of prefill GPU kernel time at `npp=2048`, below the `<5%` + reject rule and far below the `8%` source-funding threshold. +- The `npp=2048` FA cost is about `4.9 us/tok` for llama.cpp and `3.1 us/tok` + for vLLM, so the cross-engine FA delta is only about `1.7 us/tok`, below the + `15 us/tok` funding threshold. +- The dominant remaining llama.cpp buckets are still MoE/FFN GEMM, GDN, + bf16 projections, layout copies, and activation quantization. Phase63 did not + identify a new low-conflict source patch that can move GB10 parity without + reopening already-rejected W4A16/GDN/MTP/small-M work. +- No llama.cpp source files were modified. Default inferencing stayed green with + the canonical md5/op gates. + +## Layout Trace Phase64 Result + +Phase64 is recorded in +`docs/superpowers/plans/2026-07-01-layout-trace-phase64.md`. +It added default-off CUDA layout attribution to the llama.cpp fork: + +- Fork commit: `fa944bb5f feat(cuda): trace layout tensor names` +- Env gate: `LLAMA_LAYOUT_TRACE=` +- Traced runtime routes: `GET_ROWS`, `CPY`, `CONT`, `DUP`, `CONCAT` +- DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519` + +Patched build gates passed: + +| check | value | +|-------|-------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Bounded `npp=512` trace distribution: + +| route | lines | +|-------|------:| +| `get_rows` | `7268` | +| `cpy` | `2008` | +| `cont` | `1734` | +| `concat` | `990` | + +Top traced layout sources: + +- `concat conv_states_reshaped-N + qkv_mixed_transposed-N -> conv_input-N` +- `cpy conv_state_last-N -> conv_state_update-N` +- `get_rows cache_r_lN -> conv_states-N` +- `get_rows ffn_moe_probs-N -> ffn_moe_weights-N` +- `get_rows node_* with ffn_moe_topk-N` for expert fan-in weights +- attention mask/KV reshapes and f32-to-f16 copies for paged full-attention layers + +Decision: + +- Keep the instrumentation in the fork as a default-off diagnostic patch. +- Do not fund a Phase64 layout optimization yet. The trace points at GDN + conv-state materialization, MoE top-k fan-in gathers, and paged-attention + mask/KV reshapes, not a single clean projection/layout shortcut. +- Any Phase65 source work must either remove a named repeated layout chain with + md5/op gates, or close as another measured no-go. + +## Quant Trace Phase65 Result + +Phase65 is recorded in +`docs/superpowers/plans/2026-07-01-quant-trace-phase65.md`. +It added default-off activation-quant route attribution to the llama.cpp fork: + +- Fork commit: `afc2c7030 feat(cuda): trace activation quant routes` +- Env gate: `LLAMA_QUANT_TRACE=` +- DGX mirror commit: `7863194bd feat(cuda): trace activation quant routes` +- DGX artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729` + +Patched build gates passed: + +| check | value | +|-------|-------| +| MoE md5 | `8cb0ce23777bf55f92f63d0292c756b0` | +| dense md5 | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | `1146/1146` | +| `MUL_MAT_ID` | `806/806` | + +Bounded MoE `npp=512`, `ntg=4`, `npl=32` quant trace: + +| route | lines | +|-------|------:| +| `mmq_dense` | `4444` | +| `mmq_moe_dedup_unique` | `2960` | +| `mmq_moe_gather` | `2960` | +| `mmq_moe_flat` | `1480` | + +Dominant default-path shapes: + +| count | route | source family | K | rows | ne12 | +|------:|-------|---------------|---:|-----:|-----:| +| `2560` | `mmq_moe_dedup_unique` | gate/up experts | `2048` | `512` | `512` | +| `2560` | `mmq_moe_gather` | gate/up experts | `2048` | `4096` | `512` | +| `2560` | `mmq_dense` | shared expert gate/up | `2048` | `512` | `1` | +| `1280` | `mmq_moe_flat` | down experts | `512` | `4096` | `512` | +| `1280` | `mmq_dense` | shared expert down | `512` | `512` | `1` | + +Decision: + +- Keep the instrumentation in the fork as a default-off diagnostic patch. +- Do not fund a quantization optimization from route counts alone. The trace + confirms the activation-quant bucket is concentrated in MoE gate/up dedup plus + gather, MoE down flat quantization, and shared-expert dense quantization, but + it does not prove which sub-kernel is material. +- Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with + nsys/NVTX before changing source behavior. + +## Quant Kernel Timing Phase66 Result + +Phase66 is recorded in +`docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md`. +It used the Phase65-gated binary and Nsight Systems to time the activation-quant +candidate kernels directly. + +- DGX artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256` +- Profile: `quant_npp512.nsys-rep` +- Kernel summary: `quant_npp512_kern_sum_cuda_gpu_kern_sum.csv` +- Shape: MoE `npp=512`, `ntg=4`, `npl=32` + +Observed total GPU kernel time: `7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +Decision: + +- Reject a Phase66 gather/quant source optimization. `gather_mmq_fp4` is not a + material standalone target, and `quantize_mmq_nvfp4 + gather_mmq_fp4` is below + the `8%` source-funding threshold for this shape. +- Do not reopen W4A16/no-activation-quant from this evidence. Earlier W4A16 + phases already rejected that rewrite; Phase66 only rules out a smaller + gather/quant shortcut. + +## BF16 cuBLAS F32 Output Phase67 Result + +Phase67 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md`. +It added a default-off BF16 projection shortcut: + +- Fork commit: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1` +- DGX mirror commit: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909` + +Default and opt-in gates passed: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +Same-window MoE prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +Opt-in `npp=512` nsys profile: + +| row | value | +|-----|------:| +| total GPU kernel time | `7020867757 ns` | +| `convert_unary<__nv_bfloat16, float>` | `0 ns`, `0` instances | +| `convert_unary` | `159651026 ns`, `6840` instances, `2.27%` | + +Decision: + +- Keep the patch as a default-off opt-in shortcut. It is md5/op clean and + removes the profiled BF16-to-F32 conversion row for this shape. +- Do not make it default-on yet. The gain is modest and needs dense plus serving + A/B before a default policy change. + +## BF16 F32 Output Dense Serving Phase68 Result + +Phase68 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +It reused the Phase67 source commit and did not change llama.cpp source. + +- Fork commit under test: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- DGX mirror commit under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- Env gate: `LLAMA_BF16_CUBLAS_F32_OUT=1` +- DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710` +- Serving A/B artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249` + +Correctness basis for this exact source commit remains the Phase67 default and +opt-in gates: + +| mode | MoE md5 | dense md5 | `MUL_MAT` | +|------|---------|-----------|-----------| +| default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | +| opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | + +Dense same-window prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: + +- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off. The dense prefill gain is + positive but too small to justify a default policy change. +- The opt-in is now worth carrying forward: MoE prefill, dense prefill, and the + small MoE serving window all moved in the right direction without changing the + Phase67 md5/op correctness gates. +- Next default-on consideration requires regenerating the LocalAI patch series + from the fork and rerunning the broader current serving snapshot gates. Do not + default it from Phase68 alone. + +## Patch Series Mirror Readiness Phase69 Result + +Phase69 is recorded in +`docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`. +It did not change llama.cpp source and did not edit generated LocalAI patch +files. It verified that the current LocalAI series is still drift-free at the +Phase37 tip, then dry-ran the additive patches needed to mirror the current +local fork HEAD. + +Current committed series: + +| check | value | +|-------|-------| +| base | `0ed235ea2c17a19fc8238668653946721ed136fd` | +| patch count | `54` | +| applied tree | `dedb1182910eafe9f6875588dc8285bfb544cce5` | +| Phase37 fork-tip tree | `dedb1182910eafe9f6875588dc8285bfb544cce5` | +| current fork HEAD tree | `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4` | +| committed series matches Phase37 tip | `yes` | +| committed series matches current fork HEAD | `no` | + +Dry-run export from `2d590d770..ea0875d14` produced ten additive source-only +candidate patches: + +| projected patch | source commit | +|-----------------|---------------| +| `0064-feat-server-trace-serving-admission-batches.patch` | `c6cb8460e` | +| `0065-feat-server-add-admission-trace-histograms.patch` | `bd7b2e952` | +| `0066-feat-server-add-TTFT-prefill-first-scheduler-mode.patch` | `8a97629a4` | +| `0067-feat-server-cap-TTFT-prefill-first-decode-deferral.patch` | `3b6ab5fa8` | +| `0068-feat-server-gate-TTFT-defer-by-prompt-backlog.patch` | `8759213e3` | +| `0069-test-cuda-cover-W4A16-direct-activation-policy.patch` | `41be3da5b` | +| `0070-feat-cuda-route-W4A16-direct-activation-stub.patch` | `7967ad47f` | +| `0071-feat-cuda-trace-layout-tensor-names.patch` | `fa944bb5f` | +| `0072-feat-cuda-trace-activation-quant-routes.patch` | `afc2c7030` | +| `0073-feat-cuda-gate-BF16-cuBLAS-F32-output.patch` | `ea0875d14` | + +Projected mirror check: + +| check | value | +|-------|-------| +| current patches | `54` | +| missing patches | `10` | +| projected patches | `64` | +| applied plus missing tree | `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4` | +| fork HEAD tree | `fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4` | +| projected series matches fork HEAD | `yes` | + +Decision: + +- The Phase68 BF16 F32 opt-in would become projected patch `0073` and has a + conflict-free path into the LocalAI series. +- Do not commit generated patches yet. The fork branch is `26` commits ahead of + `fork/localai-paged`, and the repo workflow requires pushing the fork before + regenerating the LocalAI patch series. Push still requires explicit approval. +- After push approval, regenerate `0064..0073`, repeat the tree hash check, and + only then run broader serving gates for any default-on BF16 policy decision. + +## BF16 F32 Output Broader Serving Phase70 Result + +Phase70 is recorded in +`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +It did not change llama.cpp source and did not edit generated LocalAI patches. +It also creates the running benchmark ledger at +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. + +- DGX artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500` +- Source under test: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` +- Shape: MoE serving, `NPL=8 32 128`, prompt `128`, generation `64`, + `PARALLEL=128`, `CTX=131072` + +Pre/post gates passed: + +| gate | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|------|---------|-----------|-----------|--------------| +| pre default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| pre opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | +| post default | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post opt-in | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | not run | + +Serving A/B and vLLM comparison: + +| n | default agg | opt-in agg | vLLM agg | default decode | opt-in decode | vLLM decode | +|---:|------------:|-----------:|---------:|---------------:|--------------:|------------:| +| `8` | `178.5` | `158.8` | `260.9` | `242.6` | `218.3` | `299.5` | +| `32` | `250.1` | `247.9` | `465.3` | `418.7` | `417.6` | `608.4` | +| `128` | `322.5` | `324.8` | `659.9` | `706.2` | `697.9` | `1020.4` | + +Ratios: + +| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | +|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:| +| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | +| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | +| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | + +Decision: + +- Reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. +- Keep the shortcut as default-off only. It is correctness-clean, but the + broader serving window regressed `n=8` materially and slightly widened the + vLLM decode gap at `n=32` and `n=128`. +- The next parity phase should not spend more time on this default policy. Use + the benchmark ledger for every following attempt. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_REOPEN_SPEC.md b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_REOPEN_SPEC.md new file mode 100644 index 000000000000..5fa4353978d4 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_REOPEN_SPEC.md @@ -0,0 +1,376 @@ +# GB10 vLLM Parity Reopen Spec + +Status: scoped follow-up. This document intentionally challenges the current +`VLLM_PARITY_FINAL.md` conclusion that GB10 parity is closed. The final record is +still useful as a baseline, but the follow-up work must treat it as a hypothesis +to test, not as a proof of impossibility. + +## Goal + +Determine whether llama.cpp / ggml can close the remaining GB10 parity gap for +Qwen3.6 NVFP4 hybrid gated-DeltaNet models by porting or adapting concrete vLLM +implementation ideas, while preserving LocalAI's hard correctness gates. + +Success means one of two outcomes: + +1. A measured, source-backed path improves paged llama.cpp materially toward vLLM + parity on GB10. +2. The remaining gap is rejected with clean provenance: clean source, clean DGX + host state, artifact-pinned A/B results, and explicit correctness gates. + +## Non-goals + +- Do not accept a "closed" conclusion based only on existing docs. +- Do not run long builds or benchmarks without a recorded DGX preflight. +- Do not edit `patches/paged/*.patch` directly. Kernel changes land fork-first in + `mudler/llama.cpp:localai-paged`, then the LocalAI patch series is regenerated. +- Do not treat a standalone PoC as a result. Every performance claim requires an + in-backend A/B. +- Do not ship lossy paths default-on. Non-byte-identical paths require KL gates. + +## Required Preflight + +Before any DGX build, benchmark, or profile: + +1. `docker ps` must show no running containers, especially no `local-ai-worker`. +2. `nvidia-smi --query-compute-apps=pid` must show zero compute apps. +3. `~/gpu_bench_lock/owner` must be absent or `FREE*`. +4. Record hostname, git SHA, dirty status, build arch, binary mtimes, model paths, + benchmark command, and environment variables. + +Use `~/_git/llama.cpp` as the local source of truth. DGX source trees are allowed +for builds and artifact inspection, but dirty DGX checkouts must not be treated as +canonical source. + +## Evidence From Subagent Audit + +Four read-only subagents audited the current state: + +- llama.cpp / ggml source audit. +- vLLM source and installed package audit. +- LocalAI patch and docs audit. +- DGX artifact and profile audit. + +Their shared conclusion: the final docs are a useful snapshot, but several +claims are broader than the available evidence. + +Key findings: + +- The strongest unresolved implementation target is W4A16 grouped MoE prefill. + vLLM uses Marlin W4A16 on GB10, and llama.cpp already has a correct but untuned + scaffold in `ggml/src/ggml-cuda/w4a16-gemm.cu`. +- The existing W4A16 rejection is a first-implementation failure, not a proof of + impossibility. The patch header names fixable costs: f32 to bf16 cast pre-pass, + host tile-map setup, small copies, scalar dequant, and ragged tile waste. +- The 924 t/s paged GPU-steady decode figure is artifact-backed, but the vLLM + 1078 t/s true GPU-steady figure was not found as a self-contained + ntg16/ntg64 difference-method artifact. Reproduce before relying on the 86% + claim. +- GDN M5 is real, but M5/M8 provenance is muddy because CDEF records a dirty + dev-tree M8 commit while docs describe production M5 defaults. +- S3 fixed-period scheduling and fixed-slot padding were rejected, but adaptive + scheduling remains unproven. + +## Candidate Workstreams + +### A. Provenance And Baseline Reproduction + +Purpose: make later claims defensible. + +Tasks: + +- Build from clean `~/_git/llama.cpp` `localai-paged` source, or a clean DGX clone + generated from that source. +- Re-run canonical md5 gates for paged MoE and dense: + - paged MoE: `8cb0ce23777bf55f92f63d0292c756b0` + - dense: `5951a5b4d624ce891e22ab5fca9bc439` +- Re-run a short prefill baseline for MoE and dense at `npp=512,2048`. +- Re-run graph-node-traced decode for paged and vLLM using the same + difference-method shape: `ntg=16` and `ntg=64` at N=128 or N=256. + +Gate: + +- No implementation work starts until the baseline artifact names, source SHAs, + and commands are recorded. + +### B. W4A16 Grouped MoE Prefill Attack + +Purpose: port the vLLM Marlin W4A16 advantage into ggml's in-backend MoE prefill +path. + +Current hooks: + +- `ggml/src/ggml-cuda/w4a16-gemm.cu` +- `ggml/src/ggml-cuda/w4a16-gemm.cuh` +- `ggml/src/ggml-cuda/ggml-cuda.cu` around `ggml_cuda_mul_mat_id` +- `ggml/src/ggml-cuda/mmq.cu` around `LLAMA_W4A16_PREFILL_M` + +Known current costs: + +- Separate f32 to bf16 activation cast pass. +- Host-built tile metadata and H2D copies. +- Scalar in-register FP4 to bf16 dequant. +- 4-byte weight staging. +- Ragged expert tile waste. +- Interaction with the generic token-sorting fallback. + +Phased experiments: + +1. Reconfirm current 0035 W4A16 performance with clean provenance. +2. Remove or fuse the f32 to bf16 activation cast pre-pass. +3. Move tile metadata generation device-side or cache it across repeated shapes. +4. Improve weight staging width and shared-memory layout. +5. Tune tile shapes for ragged per-expert M distribution. +6. Compare against FP4-MMQ and vLLM Marlin buckets with nsys. + +Correctness gate: + +- `test-backend-ops MUL_MAT_ID` forced W4A16. +- Greedy md5 for unaffected default-off path. +- KL gate for engaged W4A16 path: `KLD(W4A16||f16) <= KLD(FP4-MMQ||f16)`. +- Decode path unchanged when `LLAMA_W4A16_PREFILL_M=0`. + +Benchmark gate: + +- Beat default FP4-MMQ on MoE `S_PP` at `npp=512` and `npp=2048`. +- No material peak-memory increase. +- No decode regression in the default path. + +### C. Native Ragged Grouped FP4-MMA Prefill + +Purpose: test whether patch 0034's native FP4-MMA PoC failed due to integration, +not due to the core kernel idea. + +Current hooks: + +- `ggml/src/ggml-cuda/fp4-gemm.cu` +- `ggml/src/ggml-cuda/fp4-gemm.cuh` +- `LLAMA_FP4_PREFILL_M` + +Experiment: + +- Build a graph-safe ragged grouped FP4-MMA MoE prefill kernel that avoids the + per-expert host-sync loop. + +Correctness gate: + +- Same KL and op gates as W4A16. +- Explicit proof that the per-expert host fallback is not on the hot path. + +Benchmark gate: + +- Beat current FP4-MMQ or lose decisively enough to close this branch. + +### D. GDN Chunked Scan Follow-up + +Purpose: compare vLLM's in-tree FLA-derived GDN path against the current M5 +implementation without relying on muddy dev-tree artifacts. + +Current hooks: + +- `ggml/src/ggml-cuda/gated_delta_net.cu` +- `GDN_TC` +- `GDN_CHUNK_MIN` +- existing M5 tensor-core ladder + +Phased experiments: + +1. Clean A/B: current production M5 against sequential and against recorded M8 + dev-tree behavior. +2. C=32 and C=64 variants. +3. dv slab variants. +4. cp.async staging variants. +5. Register-state variant only if the lower-risk variants show headroom. + +Correctness gate: + +- `test-backend-ops GATED_DELTA_NET`, including multi-chunk, tail-chunk, + multi-seq, and adversarial decay cases. +- Greedy md5 per path. +- KL gate for any non-byte-identical path. + +Benchmark gate: + +- Beat current M5, not just old sequential. +- Preserve decode behavior by keeping `GDN_CHUNK_MIN > 1`. + +### E. MoE Weighted Fan-in Fusion + +Purpose: remove generic graph-level MoE reduction overhead that vLLM avoids or +amortizes through fused MoE handling. + +Current source: + +- `src/llama-graph.cpp`, MoE down projection and expert reduction. +- `ggml/src/ggml-cuda/ggml-cuda.cu`, CUDA fusion and MoE support. + +Experiment: + +- Add a CUDA-specific fused path for `down_experts * weights -> sum expert_used` + while preserving the current reduction order where required. + +Correctness gate: + +- Bit-exact for supported shapes, or KL-benign if reduction order changes. +- Handles all `n_expert_used` used by Qwen3.6 MoE. + +Benchmark gate: + +- Move MoE prefill or decode wall time by more than noise. If it is only a + 2-3% dispatch bucket, record and deprioritize. + +### F. Adaptive Serving Scheduler + +Purpose: keep S3's decode-window benefit without reproducing its TTFT collapse. + +Current hooks: + +- `tools/server/server-context.cpp` +- `LLAMA_PAGED_DECODE_STABLE` +- `LLAMA_PAGED_PREFILL_PERIOD` +- existing dynamic prefill budget patches. + +Experiment: + +- Replace fixed-period prefill deferral with adaptive admission based on live + decode width, waiting prefill backlog, and TTFT budget. + +Correctness gate: + +- Serving output correctness unchanged. +- No starvation of prefill requests. + +Benchmark gate: + +- Improve aggregate throughput or decode throughput at N=128 or N=256 without + the 2.5x TTFT regression from fixed S3. + +### G. Projection And GDN Glue Fusion + +Purpose: steal vLLM's `prepare_gdn_attention_core_inputs` idea where ggml still +pays small copy, cat, slice, or unpack kernels. + +Current source: + +- `src/models/qwen35.cpp` +- `src/models/qwen35moe.cpp` +- `ggml/src/ggml-cuda/ggml-cuda.cu` + +Experiment: + +- Fuse q/k/v/z unpacking, BA projection preparation, RMSNorm-gated output prep, + and FP4/FP8 quant prep where the graph pattern is stable. + +Correctness gate: + +- Per-op tests for new fusion. +- Greedy md5 for model paths. + +Benchmark gate: + +- Only continue if nsys shows this bucket is material after MoE and GDN work. + +## Subagent Plan + +Use subagents for independent read, implementation, and review slices. Do not use +subagents to edit the same files in parallel. + +Recommended roles by phase: + +- Phase 0 source/provenance agent: owns command capture and source SHA checks. +- Phase 0 artifact agent: owns parsing existing and new benchmark artifacts. +- W4A16 kernel agent: owns `w4a16-gemm.*`. +- W4A16 integration agent: owns `ggml-cuda.cu` and `mmq.cu` dispatch plumbing. +- GDN kernel agent: owns `gated_delta_net.cu`. +- Scheduler agent: owns server scheduling files only. +- Reviewer agent: reviews gates, provenance, and whether measured claims match + artifacts. + +Subagent output requirements: + +- File paths and functions inspected or changed. +- Exact commands run. +- Exact artifacts produced. +- Pass/fail result against the phase gate. +- Any uncertainty labeled explicitly. + +## Phase Order + +### Phase 0 - Reproduce And Correct The Record + +Do first. + +Deliverables: + +- Clean source/build provenance. +- Short prefill baseline. +- Graph-node-traced decode difference-method for paged and vLLM. +- Updated docs if the 86% decode claim or CDEF provenance changes. + +Exit criteria: + +- Baseline is trustworthy enough to judge optimization deltas. + +### Phase 1 - W4A16 MoE Prefill + +Do second. + +Deliverables: + +- Reconfirmed current W4A16 baseline. +- At least one targeted W4A16 overhead removal. +- A/B against default FP4-MMQ. + +Exit criteria: + +- Either W4A16 beats FP4-MMQ and continues, or it is rejected with direct + artifact-backed evidence. + +### Phase 2 - GDN Follow-up + +Do after Phase 1 unless Phase 0 proves decode/GDN is the larger immediate gap. + +Deliverables: + +- Clean M5 vs candidate geometry A/B. +- Correctness gates for all candidate variants. + +Exit criteria: + +- Keep the best variant or close the branch with measured evidence. + +### Phase 3 - MoE Fan-in And Glue Fusions + +Do after kernel work identifies remaining non-kernel buckets. + +Deliverables: + +- nsys-backed bucket selection. +- Fusion implementation only for material buckets. + +Exit criteria: + +- Keep only fusions that move end-to-end numbers beyond noise. + +### Phase 4 - Adaptive Serving + +Do after compute kernels are stable. + +Deliverables: + +- Adaptive scheduling policy. +- Serving A/B at N=8,32,128,256. + +Exit criteria: + +- Improve serving without TTFT collapse. + +## Decision Rules + +- Prefer measured in-backend results over source plausibility. +- Prefer small kill-gate experiments over multi-week rewrites. +- Continue a branch only if it beats the current shipped path, not an obsolete + baseline. +- Document rejected branches with artifact paths so they are not rerun. +- Keep the fork branch canonical and regenerate LocalAI patches from it. + diff --git a/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md new file mode 100644 index 000000000000..404f36abcd0e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md @@ -0,0 +1,172 @@ +# GDN Shared-A/Ai Cost Model + +Phase 12 decides whether the next GDN prefill attempt should implement a +shared-A/Ai global-scratch prototype or stop GDN kernel work on GB10. + +## Reference Points + +llama.cpp: + +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + - `gated_delta_net_chunked_cuda` + - `launch_gdn_chunked` + - `launch_gated_delta_net` + - `ggml_cuda_op_gated_delta_net` + +vLLM/FLA: + +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py` + - `chunk_gated_delta_rule_fwd` +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py` + - `solve_tril` + - `solve_tril_16x16_kernel` + - `merge_16x16_to_32x32_inverse_kernel` + - `merge_16x16_to_64x64_inverse_kernel` +- `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py` + - `recompute_w_u_fwd` + +## Metadata + +DGX metadata artifact: + +- `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +GGUF metadata: + +| Model | Arch | Blocks | Full-attn interval | GDN layers | SSM inner | SSM state | GDN heads | +|-------|------|--------|--------------------|------------|-----------|-----------|-----------| +| MoE | `qwen35moe` | 41 | 4 | 30 inferred | 4096 | 128 | 32 inferred | +| Dense | `qwen35` | 64 | 4 | 48 inferred | 6144 | 128 | 48 inferred | + +Notes: + +- `GDN heads = ssm.inner_size / ssm.state_size`. +- MoE has one `nextn` layer; the serving/prefill stack uses the 40 normal + layers, with 30 GDN layers at interval 4. +- Dense has 64 layers, 48 GDN layers at interval 4. + +## Dynamic Shared Memory + +Formula: + +```text +C16 full-width current M5: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 full-width: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 slab64 with U staging: + floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C +``` + +For `S_v=128`: + +| Shape | Bytes | KiB | Fits GB10 dynamic smem? | +|-------|-------|-----|-------------------------| +| C16 full-width | 93,376 | 91.19 | yes | +| C32 full-width | 127,360 | 124.38 | no | +| C32 slab64 + U staging | 94,592 | 92.38 | yes | + +Implication: + +- C32 full-width cannot be a single current-style CTA on GB10. +- C32 only fits by splitting value columns or by changing state residency. +- Splitting value columns must share A/Ai or it repeats the Phase 10 failure. + +## Ai Scratch Size + +Formula: + +```text +Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype) +``` + +Benchmark shape: `npl=32`, `S_v=128`. + +| Model | H | npp | BT | Ai dtype | Chunks | Ai scratch MiB | 3x Ai traffic MiB | +|-------|---|-----|----|----------|--------|----------------|-------------------| +| MoE | 32 | 512 | 32 | f32 | 16 | 64.0 | 192.0 | +| MoE | 32 | 512 | 32 | f16 | 16 | 32.0 | 96.0 | +| MoE | 32 | 512 | 64 | f32 | 8 | 128.0 | 384.0 | +| MoE | 32 | 512 | 64 | f16 | 8 | 64.0 | 192.0 | +| MoE | 32 | 2048 | 32 | f32 | 64 | 256.0 | 768.0 | +| MoE | 32 | 2048 | 32 | f16 | 64 | 128.0 | 384.0 | +| MoE | 32 | 2048 | 64 | f32 | 32 | 512.0 | 1536.0 | +| MoE | 32 | 2048 | 64 | f16 | 32 | 256.0 | 768.0 | +| Dense | 48 | 512 | 32 | f32 | 16 | 96.0 | 288.0 | +| Dense | 48 | 512 | 32 | f16 | 16 | 48.0 | 144.0 | +| Dense | 48 | 512 | 64 | f32 | 8 | 192.0 | 576.0 | +| Dense | 48 | 512 | 64 | f16 | 8 | 96.0 | 288.0 | +| Dense | 48 | 2048 | 32 | f32 | 64 | 384.0 | 1152.0 | +| Dense | 48 | 2048 | 32 | f16 | 64 | 192.0 | 576.0 | +| Dense | 48 | 2048 | 64 | f32 | 32 | 768.0 | 2304.0 | +| Dense | 48 | 2048 | 64 | f16 | 32 | 384.0 | 1152.0 | + +`3x Ai traffic` means one Ai write plus two Ai reads for two value slabs. + +## Interpretation + +The f32 `BT=32` scratch path is large but plausible: + +- Peak scratch is 256 MiB for MoE and 384 MiB for dense at `npp=2048,npl=32`. +- Ai traffic is 768 MiB for MoE and 1.125 GiB for dense per GDN layer call. +- This is not free on LPDDR5x, but it is not automatically worse than + recomputing A/Ai in every value slab. + +The f16/BF16 Ai path halves traffic but should not be first because Phase 10 and +Phase 11 showed correctness must be established before performance. The first +prototype should store Ai in f32, stay default-off, and use md5/KL gates before +trying a lossy Ai dtype. + +## Decision + +GO: Phase 13 should implement a default-off global-Ai scratch prototype. + +Rationale: + +- The only remaining C32 path that addresses Phase 10's failure is sharing A/Ai + across value slabs. +- `BT=32` f32 scratch has acceptable peak memory for the existing GB10 + benchmark shapes. +- The implementation can be default-off and rejected cleanly if global scratch + traffic or extra launch boundaries dominate. + +Phase 13 constraints: + +- Prototype only `BT=32`, f32 Ai, two `dv_tile=64` value slabs. +- Keep decode out via `GDN_CHUNK_MIN > 1`. +- Gate with `GATED_DELTA_NET`, canonical MoE/dense md5, and same-session A/B. +- If md5 changes, run KL before benchmarking. +- If the prototype is flat or slower, reject it and stop GDN kernel work on + GB10; do not iterate into f16 Ai until f32 proves the schedule can win. + +## Phase 13 Result + +Phase 13 implemented the f32 Global-Ai32 prototype and rejected it. + +Correctness: + +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Performance: + +| Model | Mode | PP | S_PP t/s | +|-------|------|----|----------| +| MoE | M5 base | 2048 | 2425.10 | +| MoE | Global Ai32 | 2048 | 2097.76 | +| Dense | M5 base | 2048 | 1016.14 | +| Dense | Global Ai32 | 2048 | 918.19 | + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/` +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + +Final decision: + +- Reject Global-Ai32. +- Stop GDN kernel work on GB10. The remaining vLLM GDN advantage is not + reachable through the low-conflict C16/C32 patch shapes tested here. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md b/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md new file mode 100644 index 000000000000..3a2f06e6222b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md @@ -0,0 +1,514 @@ +# Plan: ship the paged llama.cpp as its OWN backend + NVFP4 Qwen3.6 gallery items + +Scoping deliverable only. NOTHING is changed by this document. It is grounded in the +actual repo structure (read 2026-06-26 in worktree feat+paged-attention), not assumptions. + +SHIPPED REALITY (update 2026-06-27): the backend ships CUDA-only. The matrix rows and +the index.yaml meta-backend keep ONLY the CUDA/cublas variants (cuda-12, cuda-13, and +the nvidia-l4t arm64 cuda-12/cuda-13 Jetson rows). The cpu / vulkan / sycl / hipblas / +metal-darwin variants discussed below as optional/phase-2 were NOT shipped (and the +darwin row was removed): off-CUDA the patchset's wins gate off, so it is neutral-to- +negative there and non-CUDA users should use the stock llama-cpp backend (README 4c). + +================================================================================ +0. GROUND TRUTH (what the repo actually does today) +================================================================================ + +The paged patchset is ALREADY integrated into the stock llama-cpp backend in this +worktree. Two mechanisms, both already present: + + (a) BUILD: backend/cpp/llama-cpp/Makefile has `LLAMA_PAGED?=on`. The `llama.cpp:` + target git-applies patches/0*.patch (base series) then, when LLAMA_PAGED != off, + patches/paged/0*.patch (the 0018-0023 paged series + the earlier 0001-0017). + prepare.sh has a fallback `patch`-based apply guarded by a sentinel + (llama.cpp/src/paged-kv-manager.cpp). So a stock `make backends/llama-cpp` TODAY + already ships the paged engine compiled in. + + (b) RUNTIME GATING: backend/cpp/llama-cpp/grpc-server.cpp ALREADY carries the option + hooks (lines ~752-842). They only call setenv() before context init: + - option `kv_paged` / `paged_kv` / `paged_attention` -> setenv LLAMA_KV_PAGED=1 + - option `kv_paged_debug` / `paged_kv_debug` -> setenv LLAMA_KV_PAGED_DEBUG=1 + - option `max_prefill_tokens` / `mpt` / `prefill_budget` -> setenv LLAMA_PREFILL_BUDGET + - option `max_batch_tokens` / `mbt` -> setenv LLAMA_MAX_BATCH_TOKENS + - option `prefill_cap` -> setenv LLAMA_PREFILL_CAP + Against UNPATCHED llama.cpp these setenv() calls are inert (nothing reads the env), + so grpc-server.cpp is byte-safe to share between a clean build and a paged build. + The paged engine itself lives entirely inside the patched llama.cpp lib + (paged-kv-manager.cpp etc.), NOT in grpc-server.cpp. + +Conclusion: "stock llama-cpp + paged patchset, runtime-gated" is the CURRENT state of +ONE backend. The task is to SPLIT that into two backends: + - llama-cpp = clean upstream llama.cpp (de-risked: a dep-bump can never break on a + paged hook), grpc-server.cpp keeps the dormant hooks. + - = stock grpc-server.cpp + paged patch series applied + paged on. + +The turboquant backend is the EXACT precedent for "a llama.cpp variant that reuses the +backend/cpp/llama-cpp grpc-server sources via a thin wrapper Makefile + its own Dockerfile ++ its own matrix rows". Copy turboquant's shape, with two simplifications (see section 1). + +CPU_ALL_VARIANTS reuse: backend/cpp/llama-cpp/Makefile already has `llama-cpp-cpu-all` +(one grpc-server + dlopen libggml-cpu-*.so via -DGGML_BACKEND_DL/-DGGML_CPU_ALL_VARIANTS, +SHARED_LIBS=ON make-var). turboquant mirrors it with `turboquant-cpu-all`. The new backend +gets the same single-build CPU target for free by reusing the same Makefile machinery. + +-------------------------------------------------------------------------------- +RECOMMENDED BACKEND NAME: `llama-cpp-paged` (see section 4 for the full rationale) +-------------------------------------------------------------------------------- +Everywhere below, NAME = llama-cpp-paged, DOCKERFILE = Dockerfile.llama-cpp-paged, +SRC DIR = backend/cpp/llama-cpp-paged/, MAKE VAR = BACKEND_LLAMA_CPP_PAGED. +DO NOT use the dotted working name `localai-llama.cpp`: a dot in Dockerfile. and +in the tag-suffix is unprecedented (every sibling is hyphenated: llama-cpp, ik-llama-cpp, +turboquant, ds4) and complicates the changed-backends.js endsWith() suffix matching. + +================================================================================ +1. NEW BACKEND - file by file +================================================================================ + +-------------------------------------------------------------------------------- +1.1 backend/cpp/llama-cpp/Makefile (the ONE necessary touch to stock) +-------------------------------------------------------------------------------- +Change exactly one default so the STOCK image ships clean against upstream: + + -LLAMA_PAGED?=on + +LLAMA_PAGED?=off + +Why: this is the entire point of the split - stock llama-cpp must build clean so an +upstream LLAMA_VERSION bump can never fail on a paged hook. The runtime hooks in +grpc-server.cpp stay (inert). The new backend forces LLAMA_PAGED=on explicitly (1.2), so +it does not depend on this default. NOTE this DOES change stock's shipped artifact (it +currently ships paged-compiled-in-but-gated); that is intended de-risking, call it out in +the PR. If the team prefers stock literally untouched, the alternative is to leave +`?=on` and accept that stock keeps carrying the patch series - but then "clean stock" is +not achieved. Recommendation: flip to off. + +(No other change to backend/cpp/llama-cpp/ - grpc-server.cpp, CMakeLists.txt, prepare.sh, +patches/, patches/paged/ are all reused as-is by the new backend.) + +-------------------------------------------------------------------------------- +1.2 backend/cpp/llama-cpp-paged/Makefile (NEW - thin wrapper, model on turboquant) +-------------------------------------------------------------------------------- +Mirror backend/cpp/turboquant/Makefile, but SIMPLER (two things turboquant needs that we +do NOT): + - turboquant overrides LLAMA_REPO/LLAMA_VERSION to a fork. We use the SAME upstream pin + as stock (it lives in backend/cpp/llama-cpp/Makefile, already auto-bumped). So we do + NOT set LLAMA_VERSION here -> no bump_deps.yaml entry needed (big simplification vs + turboquant). We only force LLAMA_PAGED=on. + - turboquant runs patch-grpc-server.sh (augments the KV-cache type allow-list) and + apply-patches.sh (fork catch-up). We need NEITHER: grpc-server.cpp already has the + paged hooks, and the paged patch series is applied by the copied llama-cpp Makefile's + own `llama.cpp:` target when LLAMA_PAGED=on. + +Shape (one flavor shown; replicate the turboquant flavor set: avx/avx2/avx512/fallback/ +cpu-all/grpc/rpc-server): + + LLAMA_CPP_DIR := $(CURRENT_MAKEFILE_DIR)/../llama-cpp + + define paged-build # $(1)=flavor $(2)=cmake flags $(3)=target + rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build + cp -rf $(LLAMA_CPP_DIR) $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build purge + # clone upstream + apply base AND paged patch series (LLAMA_PAGED=on forces it) + LLAMA_PAGED=on $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build llama.cpp + CMAKE_ARGS="$(CMAKE_ARGS) $(2)" TARGET="$(3)" LLAMA_PAGED=on \ + $(MAKE) -C $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build grpc-server + cp -rfv $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-$(1)-build/grpc-server llama-cpp-paged-$(1) + endef + + llama-cpp-paged-cpu-all: + # identical to turboquant-cpu-all: SHARED_LIBS=ON + GGML_BACKEND_DL + CPU_ALL_VARIANTS + # + --target ggml; then collect ggml-shared-libs/ for package.sh to bundle. + ... LLAMA_PAGED=on SHARED_LIBS=ON \ + EXTRA_CMAKE_ARGS="-DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON" \ + TARGET="--target grpc-server --target ggml" ... + + package: ; bash package.sh + purge: ; rm -rf $(CURRENT_MAKEFILE_DIR)/../llama-cpp-paged-*-build; rm -rf llama-cpp-paged-* package + clean: purge + +Binaries are named llama-cpp-paged-{cpu-all,fallback,grpc,rpc-server,...} so run.sh and +package.sh glob them. + +-------------------------------------------------------------------------------- +1.3 backend/cpp/llama-cpp-paged/run.sh (NEW - copy turboquant/run.sh, rename binaries) +-------------------------------------------------------------------------------- +s/turboquant/llama-cpp-paged/g. Prefers llama-cpp-paged-cpu-all if present, falls back to +llama-cpp-paged-fallback; llama-cpp-paged-grpc when LLAMACPP_GRPC_SERVERS set; Darwin +DYLD_LIBRARY_PATH branch; lib/ld.so launch. Keep verbatim otherwise. + +-------------------------------------------------------------------------------- +1.4 backend/cpp/llama-cpp-paged/package.sh (NEW - copy turboquant/package.sh, rename) +-------------------------------------------------------------------------------- +s/turboquant/llama-cpp-paged/g. Copies llama-cpp-paged-* into package/, bundles +ggml-shared-libs/*.so* into package/lib (the CPU_ALL_VARIANTS dlopen set), copies run.sh, +and the per-arch libc/ld.so set (unchanged). + +-------------------------------------------------------------------------------- +1.5 backend/Dockerfile.llama-cpp-paged (NEW - copy Dockerfile.turboquant, swap paths) +-------------------------------------------------------------------------------- +Identical 3-stage structure (builder-fromsource / builder-prebuilt / FROM scratch). Edits: + - bind/run .docker/llama-cpp-paged-compile.sh (new, 1.6) instead of turboquant-compile.sh + - ccache id: id=llama-cpp-paged-ccache-${TARGETARCH}-${BUILD_TYPE} + (OPTIONAL OPTIMIZATION: set id=llama-cpp-ccache-${TARGETARCH}-${BUILD_TYPE} to SHARE + stock llama-cpp's ccache - the paged TUs are mostly byte-identical to stock, so a warm + stock cache would give the paged build near-free object reuse. Trade-off: a regression + in one could surface as a cold miss in the other. Recommend sharing; revisit if noisy.) + - both `make -BC /LocalAI/backend/cpp/llama-cpp-paged package` + - final COPY --from=builder /LocalAI/backend/cpp/llama-cpp-paged/package/. ./ + +-------------------------------------------------------------------------------- +1.6 .docker/llama-cpp-paged-compile.sh (NEW - copy llama-cpp-compile.sh, swap make targets) +-------------------------------------------------------------------------------- +Identical to .docker/llama-cpp-compile.sh except `cd .../llama-cpp-paged` and call +`make llama-cpp-paged-cpu-all` (BUILD_TYPE empty / CPU) or `make llama-cpp-paged-fallback` +(GPU), then `make llama-cpp-paged-grpc` + `make llama-cpp-paged-rpc-server`. Keep the +arm64 gcc-14 apt step (CPU_ALL_VARIANTS armv9.2 SME needs gcc-14). ccache export unchanged. + +-------------------------------------------------------------------------------- +1.7 Makefile (top-level) - 6 edits, mirror the turboquant lines +-------------------------------------------------------------------------------- + a) .NOTPARALLEL (line 2): append `backends/llama-cpp-paged` + b) Backend def (after BACKEND_TURBOQUANT, line ~1172): + # llama-cpp-paged = stock llama.cpp grpc-server + LocalAI paged-attention patch + # series (LLAMA_PAGED=on). Reuses backend/cpp/llama-cpp sources via a thin wrapper. + BACKEND_LLAMA_CPP_PAGED = llama-cpp-paged|llama-cpp-paged|.|false|false + (lang field `llama-cpp-paged` -> Dockerfile.llama-cpp-paged, matching the + llama-cpp / ik-llama-cpp / turboquant convention where lang==backend name.) + c) generate-docker-build-target eval (after BACKEND_TURBOQUANT, line ~1273): + $(eval $(call generate-docker-build-target,$(BACKEND_LLAMA_CPP_PAGED))) + d) docker-build-backends (line ~1337): append docker-build-llama-cpp-paged + e) test-extra-backend-llama-cpp-paged target (mirror test-extra-backend-turboquant, + line ~673): BACKEND_IMAGE=local-ai-backend:llama-cpp-paged $(MAKE) test-extra-backend + f) (optional) backends/llama-cpp-paged-darwin target if shipping metal (mirror + backends/llama-cpp-darwin at line 1124; see 1.11). + +-------------------------------------------------------------------------------- +1.8 .github/backend-matrix.yml - add rows (mirror every llama-cpp row, swap names) +-------------------------------------------------------------------------------- +For EACH variant you choose to ship (see phased recommendation in section 4), add a row +copied from the corresponding llama-cpp row with: + - backend: "llama-cpp-paged" + - dockerfile: "./backend/Dockerfile.llama-cpp-paged" + - tag-suffix: swap `-llama-cpp` -> `-llama-cpp-paged` + (e.g. -cpu-llama-cpp -> -cpu-llama-cpp-paged; + -gpu-nvidia-cuda-12-llama-cpp -> -gpu-nvidia-cuda-12-llama-cpp-paged; etc.) + - builder-base-image: UNCHANGED - reuse the same base-grpc-* tags as llama-cpp + (this backend compiles the same gRPC + same toolchain; no new base-images.yml variant + is needed, so NO base-images bootstrap step). This is the cheap-variant payoff. + - CPU: TWO per-arch rows (amd64 ubuntu-latest + arm64 ubuntu-24.04-arm) sharing + tag-suffix '-cpu-llama-cpp-paged' so changed-backends.js emits a merge-matrix entry and + backend-merge-jobs assembles the manifest list. Same per-arch native + manifest-merge + pattern as -cpu-llama-cpp. + - Darwin (if shipping): add to includeDarwin: + - backend: "llama-cpp-paged" + tag-suffix: "-metal-darwin-arm64-llama-cpp-paged" + lang: "go" + (omit build-type, exactly like the llama-cpp darwin row at line 4908.) + + REMINDER: the CI path filter only builds a backend on a PR when a file under its dir + changes. The PR that adds this backend touches backend/cpp/llama-cpp-paged/* so it self- + triggers. But also add the cross-trigger in 1.9 so future edits to backend/cpp/llama-cpp/ + (the shared source) retrigger this backend too. + +-------------------------------------------------------------------------------- +1.9 scripts/changed-backends.js - two edits (mirror turboquant exactly) +-------------------------------------------------------------------------------- + a) inferBackendPath(): add BEFORE the generic `endsWith("llama-cpp")` branch (line 56), + next to the turboquant branch (line 45): + if (item.dockerfile.endsWith("llama-cpp-paged")) { + // reuses backend/cpp/llama-cpp sources via a thin wrapper Makefile + return `backend/cpp/llama-cpp-paged/`; + } + ORDER MATTERS: "Dockerfile.llama-cpp-paged".endsWith("llama-cpp") is false today, but + keep the specific branch first regardless (defensive, and returns the right path). + b) inferBackendPathDarwin(): add a case (next to the llama-cpp one at line 66): + if (item.backend === "llama-cpp-paged") { return `backend/cpp/llama-cpp-paged/`; } + c) Per-backend cross-trigger (line 274-278, mirror the turboquant block): + if (backend === "llama-cpp-paged" && !changed) { + changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/")); + } + Verify: node -e "... e.dockerfile.endsWith('llama-cpp-paged') ..." per adding-backends.md. + +-------------------------------------------------------------------------------- +1.10 backend/index.yaml - meta + image entries (META-BACKEND - capabilities map, NO uri) +-------------------------------------------------------------------------------- +GOTCHA (project_backend_meta_gotcha): a backend that ships per-platform images MUST be a +meta backend = an anchor with a `capabilities:` map and NO top-level `uri:`; the concrete +per-platform entries carry the uri. Copy the *llamacpp anchor (lines 3-31). + + Step a - meta anchor in `## metas` (after *turboquant, ~line 74): + - &llamacpppaged + name: "llama-cpp-paged" + alias: "llama-cpp-paged" + license: mit + icon: + description: | + LocalAI's paged-attention llama.cpp: on-demand paged KV cache + decode-first + prefill budget. Stock llama.cpp grpc-server + the LocalAI paged patch series. + Tuned for NVFP4 dense/MoE on Blackwell/GB10. Reuses the llama-cpp gRPC server. + urls: [ https://github.com/ggerganov/llama.cpp ] + tags: [ text-to-text, LLM, CPU, GPU, CUDA, Metal, paged-attention, nvfp4 ] + capabilities: + default: "cpu-llama-cpp-paged" + nvidia: "cuda12-llama-cpp-paged" + nvidia-cuda-12: "cuda12-llama-cpp-paged" + nvidia-cuda-13: "cuda13-llama-cpp-paged" + nvidia-l4t: "nvidia-l4t-arm64-llama-cpp-paged" + nvidia-l4t-cuda-12: "nvidia-l4t-arm64-llama-cpp-paged" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-paged" + metal: "metal-llama-cpp-paged" + # add amd/intel/vulkan keys ONLY for variants you actually build (section 4) + + Step b - a `-development` meta (mirror llama-cpp-development, line 1611) with the same + capabilities map pointing at the `*-development` image names. + + Step c - concrete image entries at end of file (mirror the llama-cpp block lines + 2106-2200), one latest + one development per variant, each as: + - !!merge <<: *llamacpppaged + name: "cpu-llama-cpp-paged" + uri: "quay.io/go-skynet/local-ai-backends:latest-cpu-llama-cpp-paged" + mirrors: [ localai/localai-backends:latest-cpu-llama-cpp-paged ] + - !!merge <<: *llamacpppaged + name: "cpu-llama-cpp-paged-development" + uri: "quay.io/go-skynet/local-ai-backends:master-cpu-llama-cpp-paged" + mirrors: [ localai/localai-backends:master-cpu-llama-cpp-paged ] + ...repeat for cuda12 / cuda13 / l4t / metal etc. + The `latest-` / `master-` uri prefix + tag-suffix MUST match the matrix tag-suffix exactly. + +-------------------------------------------------------------------------------- +1.11 Darwin (only if shipping metal; the NVFP4 target is CUDA, so metal is optional/phase 2) +-------------------------------------------------------------------------------- +If metal is shipped, also: + - scripts/build/llama-cpp-paged-darwin.sh (copy scripts/build/llama-cpp-darwin.sh; it + drives the 3 CMake variants + otool dylib bundling). Ensure it forces LLAMA_PAGED=on. + - Makefile `backends/llama-cpp-paged-darwin` target (mirror backends/llama-cpp-darwin). + - backend_build_darwin.yml: add the llama-cpp-paged branch (mirror the llama-cpp-specific + step that calls `make backends/llama-cpp-darwin`). + - index.yaml metal-llama-cpp-paged / -development image entries (already in 1.10). + - C++ proto gotcha already handled (reuses llama-cpp CMakeLists.txt with hw_grpc_proto + linking protobuf/grpc++), so no Homebrew-include failure. + +-------------------------------------------------------------------------------- +1.12 Importer / /backends/known dropdown (drop-in, NOT a new importer) +-------------------------------------------------------------------------------- +This backend consumes GGUF exactly like llama-cpp -> extend the EXISTING importer, do not +add a new one (per adding-backends.md rule 2). Edit core/gallery/importers/llama-cpp.go: + - AdditionalBackends() (line 37): append + {Name: "llama-cpp-paged", Modality: "text", + Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first budget)"} + - Import() backend allow-list (line 133): add "llama-cpp-paged" to the switch case so a + preferences.backend == "llama-cpp-paged" is honored: + case "ik-llama-cpp", "turboquant", "llama-cpp-paged": backend = b + - core/gallery/importers/importers_test.go: add a table case asserting the preference + override emits backend: llama-cpp-paged (Ginkgo/Gomega; reuse an existing public GGUF + HF fixture). Run `go test ./core/gallery/importers/...`. + +-------------------------------------------------------------------------------- +1.13 Docs +-------------------------------------------------------------------------------- + - docs/content/features/backends.md: add llama-cpp-paged to the text-to-text/LLM list, + one line noting paged KV + NVFP4 Blackwell tuning. (Not an in-house from-scratch engine + -> it is a llama.cpp variant -> do NOT add to the README maintained-engines table.) + +-------------------------------------------------------------------------------- +1.14 Does grpc-server.cpp need the paged hooks? YES - already present, reused unchanged. +-------------------------------------------------------------------------------- +The hooks (kv_paged / max_batch_tokens / prefill_budget / prefill_cap) are already in the +SHARED backend/cpp/llama-cpp/grpc-server.cpp. The paged backend reuses that file verbatim +(via the Makefile copy). No patch-grpc-server.sh step is needed (unlike turboquant). The +hooks are what translate the gallery `options:` (1.10 section 2) into the LLAMA_KV_PAGED / +LLAMA_MAX_BATCH_TOKENS env that the paged llama.cpp lib reads. + +================================================================================ +2. GALLERY ITEMS - NVFP4 Qwen3.6 dense + MoE +================================================================================ + +Add two entries to gallery/index.yaml. Schema (verified against existing GGUF items and +the LocalAI config structs): backend selection via `overrides.backend`; runtime knobs via +either typed config fields (context_size/f16/flash_attention/gpu_layers/batch) or the +`options:` string list (key:value, parsed by grpc-server.cpp set_option). + +-------------------------------------------------------------------------------- +2.1 Benchmark llama-server flags -> LocalAI model-config mapping +-------------------------------------------------------------------------------- + -c 131072 -> context_size: 131072 (LLMConfig.ContextSize, yaml context_size) + -fa on -> flash_attention: "on" (LLMConfig.FlashAttention, yaml flash_attention; string) + -ngl 99 -> gpu_layers: 99 (LLMConfig.NGPULayers, yaml gpu_layers; or omit -> DefaultNGPULayers offloads all) + -b 2048 -> batch: 2048 (schema.PredictionOptions.Batch, yaml batch) [see caveat] + --parallel 128 -> options: ["parallel:128"] (grpc-server.cpp:629; alias n_parallel) + LLAMA_KV_PAGED=1 -> options: ["paged_kv:true"] (grpc-server.cpp:778) + LLAMA_MAX_BATCH_TOKENS=512 -> options: ["max_batch_tokens:512"] (grpc-server.cpp:821; alias mbt) + f16 KV -> f16: true (LLMConfig.F16, yaml f16) + (recommended for paged) -> options: ["kv_unified:false"] (grpc-server.cpp:746 - the per-slot paged + capacity/memory benefit only materializes with a per-sequence cache; + the patch comment explicitly recommends pairing paged with kv_unified:false) + + CAVEAT (-ub 512): LocalAI sets params.n_ubatch = params.n_batch = request->nbatch() + (grpc-server.cpp:528,532). There is NO separate config field for n_ubatch, so the + benchmark's `-b 2048 -ub 512` split is NOT exactly reproducible. Options: + (i) set batch: 512 -> n_batch=n_ubatch=512 (matches -ub; the decode-first + max_batch_tokens=512 budget is the dominant prefill lever anyway, and the + benchmark states decode throughput is budget-independent), OR + (ii) set batch: 2048 -> n_ubatch also 2048 (bigger physical batch, more KV scratch). + RECOMMEND (i) batch: 512 for the shipped gallery config (closest to the measured run + + lighter memory). Flag separately: a tiny grpc-server.cpp option `n_ubatch`/`ubatch` could + be added later to honor -b/-ub independently (not required to ship). + +-------------------------------------------------------------------------------- +2.2 gallery/index.yaml entry - DENSE q36-27b-nvfp4 +-------------------------------------------------------------------------------- +- name: "qwen3.6-27b-nvfp4-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co//Qwen3.6-27B-NVFP4-GGUF # placeholder, section 3 + description: | + Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's + paged-attention llama.cpp backend: on-demand paged KV + decode-first prefill budget. + Benchmarked on GB10/DGX Spark at 90-117% of vLLM dense decode at 1.5-3x lower memory. + license: "apache-2.0" # confirm vs Qwen license + tags: [ llm, gguf, nvfp4, reasoning ] + icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png + overrides: + backend: llama-cpp-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 # see -ub caveat 2.1; matches the 512 ubatch floor + known_usecases: [ chat ] + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget) + - kv_unified:false # enables the per-slot paged capacity/memory benefit + - parallel:128 # --parallel 128 serving slots + parameters: + model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf + sha256: + uri: https://huggingface.co//Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf + +-------------------------------------------------------------------------------- +2.3 gallery/index.yaml entry - MoE q36-35b-a3b-nvfp4 +-------------------------------------------------------------------------------- +Same shape; the MoE is lighter on memory (~3B active). parallel:128 + budget 256 was the +MoE decode-throughput sweet spot in the sweep, but 512 is fine as a default; if optimizing +purely for saturated MoE decode use max_batch_tokens:256. +- name: "qwen3.6-35b-a3b-nvfp4-paged" + urls: [ https://huggingface.co//Qwen3.6-35B-A3B-NVFP4-GGUF ] + ... + overrides: + backend: llama-cpp-paged + f16: true + flash_attention: "on" + context_size: 131072 + batch: 512 + options: + - use_jinja:true + - paged_kv:true + - max_batch_tokens:512 # or 256 for max saturated MoE decode (sweep winner) + - kv_unified:false + - parallel:128 + parameters: + model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf + files: + - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf + sha256: + uri: https://huggingface.co//Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf + +Note: these are the BENCHMARK serving configs. For an interactive single-user default you +may want a second lighter gallery variant (context_size 16384, parallel 4, drop the budget) +- optional, not required to ship the benchmark reproduction. + +================================================================================ +3. GGUF PUBLISHING (so the gallery uri: resolves) +================================================================================ + +The two GGUFs already exist on the DGX dev box (final_benchmark.csv references +q36-27b-nvfp4.gguf and q36-35b-a3b-nvfp4.gguf; README.md "Models" + "Benchmarks" +document provenance: dense = native Blackwell FP4 unsloth W4A4 lineage; MoE = 241 NVFP4 +tensors from nvidia modelopt weights). To publish: + + 1. HF repos (suggest two, under the org that owns the gallery-referenced weights): + /Qwen3.6-27B-NVFP4-GGUF (single q36-27b-nvfp4.gguf) + /Qwen3.6-35B-A3B-NVFP4-GGUF (single q36-35b-a3b-nvfp4.gguf) + ORG = localai-org (brand) or mudler (personal); pick per ownership of the conversions. + 2. Upload each .gguf; compute sha256 (sha256sum) and paste into the gallery `files:` sha256 + (LocalAI verifies it on download). Without sha256 the entry still works but loses the + integrity check - fill it. + 3. Model card metadata: base_model Qwen/Qwen3.6-*, library_name gguf, quantization NVFP4, + pipeline_tag text-generation, license (confirm Qwen3.6 license terms - apache-2.0 vs + Qwen community license), a note that it REQUIRES the llama-cpp-paged backend (NVFP4 + + paged), and the GB10 benchmark table (link README.md "Benchmarks" numbers). + 4. NVFP4 requires a llama.cpp new enough to read the NVFP4 GGUF type. Confirm the pinned + LLAMA_VERSION in backend/cpp/llama-cpp/Makefile supports NVFP4 tensor types (the dev + tree that produced the GGUFs did). If the current pin predates NVFP4 GGUF support, the + backend pin must be bumped OR the paged patch series must carry the NVFP4 reader. THIS + IS A GATING CHECK before the gallery items are usable - verify on a GPU box. + 5. Provenance/licensing: the dense conversion derives from unsloth; the MoE from nvidia + modelopt weights. Ensure redistribution of the converted GGUFs is permitted and + attribute upstream in the card. + +================================================================================ +4. OPEN DECISIONS / BLOCKERS / BUILD COST +================================================================================ + +BACKEND NAME - RECOMMEND `llama-cpp-paged`. + - llama-cpp-paged (RECOMMENDED): descriptive (it IS the paged variant), hyphenated like + every sibling (llama-cpp/ik-llama-cpp/turboquant/ds4), collision-free in the + changed-backends.js endsWith() suffix scheme, self-documenting in the /backends/known + importer dropdown. Reads correctly next to "turboquant" and "ik-llama-cpp". + - localai-llama-cpp (branding alternative, ACCEPTABLE): keeps the LocalAI brand without a + dot; hyphenated and safe. Use this if marketing wants "LocalAI's own llama.cpp" framing. + Slightly less self-explanatory about WHAT differs (paged) in the dropdown. + - localai-llama.cpp (the working name; NOT RECOMMENDED): the dot makes Dockerfile.localai- + llama.cpp and tag-suffix -cpu-localai-llama.cpp the only dotted ones in the repo, and + ".cpp" looks like a file extension to the suffix matcher. Avoid. + +BLOCKERS / GATING CHECKS (cannot be closed read-only, no GPU here): + 1. NVFP4 GGUF read support in the pinned LLAMA_VERSION (section 3.4). Must verify on GPU. + If unsupported, bump the pin (which also affects stock llama-cpp) or carry the reader. + 2. The two GGUFs are not yet on HF (section 3). Gallery uri + sha256 are placeholders + until upload. Blocks gallery validation only, not the backend build. + 3. -ub vs -b split (section 2.1) is not exactly reproducible without a tiny grpc-server + option; shipped config uses batch:512. Minor, not a blocker. + 4. Flipping stock LLAMA_PAGED?=off changes stock's shipped artifact (de-risking, intended) + - get explicit sign-off since it alters a heavily-used backend's build. + +PLATFORM SHIP MATRIX (RECOMMENDED PHASING - the variant is cheap because it reuses the same +base-grpc-* prebuilt bases and the same compile machinery, so each row is just CI minutes): + Phase 1 (the benchmark target - GB10/Blackwell is CUDA): + - cuda12 amd64, cuda13 amd64, cuda13 arm64 (sbsa), l4t-cuda-12 arm64 (NVFP4/paged win) + - cpu-all amd64 + cpu-all arm64 (the single CPU_ALL_VARIANTS build; baseline coverage) + Phase 2 (parity with stock llama-cpp coverage, only if demand): + - metal-darwin-arm64 (1.11), vulkan amd64/arm64, rocm amd64, intel sycl f16/f32 + Defer rocm/sycl/vulkan/metal unless asked - the paged + NVFP4 story is GPU/CUDA-centric + and these add CI cost without a clear consumer. + +BUILD-COST ESTIMATE PER PLATFORM (with warm base-grpc-* base + ccache; the paged TUs are +~byte-identical to stock so a SHARED ccache id makes most objects free): + - CPU_ALL_VARIANTS (per arch): ~15-30 min warm / ~35-50 min cold. arm64 adds a gcc-14 + apt step. Two arches + a merge job. + - CUDA (per arch): ~25-45 min warm / ~45-75 min cold (nvcc dominates; ccache helps less + across CUDA arch flag changes). amd64 cuda12 + cuda13, arm64 cuda13 + l4t = 4 jobs. + - Metal/Darwin (if Phase 2): native macos-14 runner, ~20-35 min with the ccache cache. + - No base-images.yml change and no bootstrap dispatch (reuses existing base-grpc-* tags), + so the only new CI cost is the per-row build minutes above. PR builds read cache, don't + write; first master build per row pays the cold cost once, then warm. + +VERIFICATION (post-implementation, needs a GPU box - out of scope here): + - `make backends/llama-cpp-paged` builds + installs locally (from-source path). + - Confirm stock `make backends/llama-cpp` now builds clean (no paged-kv-manager.cpp in the + checkout) - proves the split. + - Load a published NVFP4 GGUF via the gallery entry, hit /v1/chat/completions, confirm the + server log shows LLAMA_KV_PAGED engaged (LLAMA_KV_PAGED_DEBUG trace) and the configured + max_batch_tokens/parallel took effect. + - go test ./core/gallery/importers/... green (importer drop-in case). + - node scripts/changed-backends.js dry-run: editing backend/cpp/llama-cpp/* retriggers + llama-cpp-paged (cross-trigger), editing backend/cpp/llama-cpp-paged/* triggers it too. + +================================================================================ +END OF PLAN +================================================================================ diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md b/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md new file mode 100644 index 000000000000..c422fcc58f29 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md @@ -0,0 +1,75 @@ +# Paged bit-exactness gate - per path (canonical references) + +## TL;DR + +The greedy decode of the **paged** path does not byte-match the **non-paged** +path for the MoE model. This is a **benign FP-accumulation-order difference of +the paged attention reduction**, KL-validated against the f16 reference. It is +**not a bug**. The bit-exactness gate is therefore **per path**: + +| path | model | canonical md5 | +|------|-------|---------------| +| non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` | +| paged | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` | +| non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` | +| paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) | + +Gate command (chat-template / conversation path): +``` +llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" \ + -n 48 --temp 0 --seed 1 +# paged: prefix with LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 +``` +Note: use the default chat-template path (do **not** pass `-no-cnv`; raw +completion lands in a different md5 namespace). + +**Future paged-MoE regressions compare to the PAGED reference `8cb0ce23`, not to +the non-paged `07db32c2`.** Dense is bit-exact across paths, so dense uses the +single reference `5951a5b4`. + +## Why dense is bit-exact but MoE is not + +Dense paged decode reproduces the non-paged reduction order exactly, so dense +greedy md5 is identical across paths. The MoE path runs additional kernels (the +NVFP4 MoE GEMM + expert routing) whose multi-kernel accumulation order differs +between the paged and non-paged attention layouts. Over a long greedy decode this +flips a small number of near-tied argmaxes, changing the byte stream. The same +divergence is present on the 0028 baseline, with `LLAMA_MOE_FORCE_GRAPHS` on or +off, and with the patch-0029 block-table cache on or off - it is a property of +the paged attention path, not of any one lever. + +## KL evidence that the paged path is sound (the load-bearing check) + +`llama-perplexity --kl-divergence` on `q36-35b-a3b-nvfp4.gguf`, 16 chunks, +`-c 512 -ngl 99 --seed 1`, base logits from the f16 reference +(`darwin_36b_opus/f16.gguf`, PPL 7.3734): + +| comparison | PPL(Q) | KL divergence | Same top p | Cor | +|------------|-------:|--------------:|-----------:|----:| +| f16 reference | 7.3734 | - | - | - | +| **non-paged** vs f16 | 7.3896 | 0.136597 +/- 0.003157 | 84.314% | 97.68% | +| **paged** vs f16 | 7.4009 | 0.136000 +/- 0.003285 | 84.828% | 97.58% | +| paged vs non-paged (direct) | 7.4009 (base 7.3818) | 0.050011 +/- 0.001653 | 89.044% | 99.04% | + +Direct paged-vs-non-paged: Mean Delta-p = 0.079% (no bias), RMS Delta-p = 6.187%. + +### Verdict: BENIGN + +- **Paged does not diverge from the f16 ground truth more than non-paged does.** + KLD(paged||f16) = 0.13600 <= KLD(nonpaged||f16) = 0.13660, and PPL(paged) = + 7.4009 ~ PPL(nonpaged) = 7.3896 (difference 0.011, far inside the +/- 0.29 + error bars). A real paged-MoE correctness bug would push paged measurably + *further* from f16; it does not (it is marginally closer). +- **Paged and non-paged cluster together.** They agree with each other (KLD 0.050, + 89.0% same-top-p) more than either agrees with f16 (KLD ~0.137, ~84% same-top-p), + with essentially zero probability bias. That is the signature of two equivalent + FP-reorderings of the same quantized model, both equally approximating the f16 + ground truth - not a quality regression. +- The direct same-top-p of 89.0% is below a naive ">99%" heuristic, but that + heuristic is calibrated for higher-precision models. In a 4-bit (NVFP4) model + logit near-ties are abundant, so a different-but-equivalent reduction order + flips ~11% of argmaxes with no quality cost (proven by the equal KLD-to-f16 and + zero Delta-p bias). + +Therefore the canonical gate is per path, and `8cb0ce23` is the validated paged +reference for the MoE deployment path. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md new file mode 100644 index 000000000000..3e78465da444 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md @@ -0,0 +1,2777 @@ +# PARITY_HANDOFF: how to pick up the GB10 vLLM-parity work + +> 2026-07-02 forward direction: the active plan is now +> [`EXECUTION_REARCH_SCOPE.md`](EXECUTION_REARCH_SCOPE.md), which reframes the +> per-lever "hardware floor" verdict as *ggml-execution-architecture-conditional* +> (same-silicon 2-3x is software) and scopes an additive, phased (P1 bf16-native +> stream, P2 expert-major fused MoE region, P3 Marlin large-M retry on top of +> P1+P2, P4 token-budget scheduler, P5 blocked-solve GDN, P6 fp8 KV) program with +> a falsifiable P0 kill-gate per phase. The port-forensics finding is that the +> failed single-kernel/single-boundary A/Bs below failed on *integration tax* +> (dropped into a materialize-every-node executor), not because the kernels are +> GB10-hostile; the reject log below is the evidence that grounds those verdicts. +> Read the scope doc first for what to build next. +> +> 2026-06-30 update: this handoff is now historical procedure, not the active +> verdict. The GB10 investigation was reopened in `GB10_PARITY_REOPEN_SPEC.md` +> and `GB10_PARITY_PHASE0_RESULTS.md`, with Phase 6 serving-nsys evidence and +> the active follow-up plans under `docs/superpowers/plans/`. Use those files for +> the current state before relying on the older "closed" conclusion below. +> +> 2026-07-01 Phase112 update: keep the new default-off +> `LLAMA_W4A16_DIRECT_A=1` direct activation staging hook, especially combined +> with Phase110 `LLAMA_MOE_GPU_SORT=1`. Artifact: +> `/home/mudler/bench/phase112_w4a16_direct_a/20260701_231749_direct_a`. +> Selected gates passed `13/13` for W4A16+GPU-sort, direct-A, and +> direct-A+GPU-sort. Direct-A+GPU-sort improved the 257-token W4A16 fallback +> rows versus W4A16+GPU-sort control (`MOE_SWIGLU_DOWN 1551.08 -> 1477.74 us`, +> `MUL_MAT_ID_RAGGED_MOE 2278.50 -> 2166.22 us`) but was neutral/slightly +> slower on 128-token rows. Canonical README md5 gates are green: MoE +> `8cb0ce23777bf55f92f63d0292c756b0`, dense +> `5951a5b4d624ce891e22ab5fca9bc439`; compact supported op gates are green +> (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`). +> This is still default-off structural groundwork, not parity: W4A16 fallback +> remains slower than the default grouped-MMQ path. Use the patch-series README +> md5 command as canonical; the handoff `-no-cnv -c 4096` snippet produced +> stable but non-canonical md5s for both candidate and control. +> +> 2026-07-01 Phase113 update: reject the combined direct-A GPU-tile descriptor +> attempt. Artifact: +> `/home/mudler/bench/phase113_w4a16_direct_a_gpu_tiles/20260701_233345_no_readback`. +> The candidate (`LLAMA_W4A16_GPU_TILES=1` on top of Phase112 direct-A+GPU-sort) +> avoided the `n_tiles` readback by launching over zero-initialized `max_tiles` +> and returning early on `rows <= 0`. Selected correctness passed `13/13`, but +> perf failed the keep gate: `MOE_SWIGLU_DOWN n=257` was flat +> (`1478.16 -> 1476.36 us`) and `MUL_MAT_ID_RAGGED_MOE n=257` regressed +> (`2148.44 -> 2214.23 us`). The source was reverted and post-revert +> Phase112 direct-A+GPU-sort selected gates passed `13/13`. Next W4A16/MoE work +> should not revisit compact GPU tile descriptors; use vLLM-style padded routing +> metadata (`sorted_token_ids`, expert ids per M block, padded row count) if +> continuing this line. +> +> 2026-07-01 Phase114 update: reject the naive padded routing implementation. +> It implemented the vLLM-style metadata contract with separate padded source +> ids and destination ids for llama.cpp, plus an expert-id W4A16 consumer mode +> and a direct scatter that skipped compact `get_rows_cuda`. Correctness passed +> (`13/13`) but perf failed: after a fix using `num_tokens_post_pad` early +> returns, `MOE_SWIGLU_DOWN n=257` regressed `1477.88 -> 1726.27 us` and +> `MUL_MAT_ID_RAGGED_MOE n=257` regressed `2163.35 -> 2650.93 us`. Artifacts: +> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_234634_padded_meta` +> and +> `/home/mudler/bench/phase114_w4a16_padded_routing/20260701_235003_padded_meta_fix1`. +> Source was reverted; post-revert Phase112 direct-A+GPU-sort selected gate +> passed `13/13`. Padded metadata is not enough by itself on GB10 because sparse +> expert occupancy makes padded activation/output traffic too expensive. +> +> 2026-07-02 Phase115 update: reject another small-M/tile-policy shortcut. +> Phase115 re-tested the existing default-off `LLAMA_MOE_SMALL_M_TILE=16/32/64` +> knob on the newer Phase108 whole-graph MoE sentinels. Artifact: +> `/home/mudler/bench/phase115_moe_small_m_sentinel/20260702_020258`. +> Control and all three tile caps passed selected correctness (`13/13` each), +> but no candidate met the promotion rule. The 257-token ragged down row +> regressed for every cap (`1452.30 us` control vs `1455.02`, `1458.71`, and +> `1456.88 us`). Do not add name-based down special cases or another MMQ +> tile-policy patch. The next credible target is a true fused routed-MoE kernel +> or a graph-level fusion that removes materialized activation/output traffic. +> +> 2026-07-02 Phase116 update: reject the standalone graph-level +> SwiGLU-to-MMQ-activation-quant fusion. The default-off candidate +> `LLAMA_MOE_SWIGLU_DOWN_FUSED_QUANT=1` detected the plain +> `GLU -> down MUL_MAT_ID` pattern and computed `silu(gate) * up` directly into +> the grouped-MMQ NVFP4 activation buffer. Artifact: +> `/home/mudler/bench/phase116_moe_swiglu_down_fused_quant/20260702_022611`. +> Correctness passed (`13/13`) and the fix1 route emitted the fused marker +> (`6` hits), but perf was not useful: `MOE_SWIGLU_DOWN n=257` was flat +> (`1024.90 -> 1024.69 us`), `n=128` regressed (`806.33 -> 808.79 us`), and the +> ragged sentinel drifted slower. Source was reverted and post-revert selected +> gate passed `13/13`. Do not retry this narrow fused-quant route; the next +> fused-MoE attempt must remove a larger boundary, such as route-once metadata +> shared by both expert GEMMs plus fused GEMM1/activation/GEMM2 or +> weighted-combine/scatter. +> +> 2026-07-02 Phase117 update: keep the default-off MoE boundary trace as +> diagnostic instrumentation only. Artifact: +> `/home/mudler/bench/phase117_moe_route_once_boundary/20260702_024140`. +> The trace decomposes `MOE_SWIGLU_DOWN` into route-sort, activation +> quantization, grouped-MMQ launch, GLU, and graph-pattern records under +> `LLAMA_MOE_BOUNDARY_TRACE=1`; optional timing is gated by +> `LLAMA_MOE_BOUNDARY_TIMING=1`. Inline CUDA event timing initially aborted +> under CUDA graph capture, so the guarded trace emits `us=-1` while capturing +> and only produces real event timings with `GGML_CUDA_DISABLE_GRAPHS=1`. +> Post-guard selected gates passed (`13/13`), trace mode passed (`7/7`), and +> canonical gates passed: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. The timing attribution does not +> fund another local route-sort, tile, GLU, or activation-quant shortcut. The +> next MoE source phase should own a larger pipeline boundary: shared +> route-once metadata across gate_up/down and/or whole-pattern +> GEMM1->activation->GEMM2 execution. +> +> 2026-07-02 Phase118 update: reject standalone route-metadata caching. +> Artifact: +> `/home/mudler/bench/phase118_moe_route_cache/20260702_030549`. The +> default-off candidate `LLAMA_MOE_ROUTE_CACHE=1` stored ids-derived grouped-MMQ +> route metadata in context-owned buffers and reused it within a graph +> evaluation. It was correctness-clean (`13/13` default, opt-in, and +> post-reject) and the trace showed reuse (`23` hits, `3` misses on +> `MOE_SWIGLU_DOWN n=128`), but perf was too small: `MOE_SWIGLU_DOWN n=257` +> improved only `1017.711 -> 1011.915 us` (`+0.57%`) and `n=128` regressed +> `799.360 -> 803.738 us` (`-0.55%`). Runtime cache source was reverted; only a +> local `ggml_cuda_mmq_ids_meta` helper refactor remains as low-conflict +> groundwork. Do not retry metadata-cache-only work. The next attempt must own +> more of the vLLM-style pipeline: GEMM1->activation->GEMM2 and/or +> scatter/combine, not just skipping one `mm_ids_helper` launch. +> +> 2026-07-02 Phase119 update: keep the default-off whole-pattern contract trace +> after fix1. Initial artifact: +> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_034729`; +> fix1 artifact: +> `/home/mudler/bench/phase119_moe_whole_pattern_contract/20260702_035126_fix1`. +> The initial trace proved coverage but missed the overhead rule on +> `MOE_SWIGLU_DOWN n=257` (`1015.070 -> 1028.937 us`, `-1.35%`). Fix1 moved +> detector work off the default path unless `LLAMA_MOE_WHOLE_PATTERN_TRACE` or +> the existing boundary trace is enabled. Fix1 gates are green: selected +> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, trace `MOE_SWIGLU_DOWN` +> `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Trace overhead is now within +> rule (`MOE_SWIGLU_DOWN n=128` `805.400 -> 805.584 us`, `-0.02%`; +> `n=257` `1019.715 -> 1021.836 us`, `-0.21%`) and emits supported NVFP4 +> markers for both `n_tokens=128` and `257`. This is diagnostic scaffolding, +> not a runtime optimization. The next executor attempt should match at the +> earlier `gate_up MUL_MAT_ID` node and skip through `VIEW, VIEW, GLU, down +> MUL_MAT_ID`; the current `GLU -> down` hook is validation-only because GEMM1 +> has already executed. +> +> 2026-07-02 Phase120 update: keep the default-off early whole-pattern matcher +> after fix2. Initial artifact: +> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040153`; +> fix2 artifact: +> `/home/mudler/bench/phase120_moe_early_whole_pattern/20260702_040725_fix2`. +> The initial/fix1 versions proved `skip_ready=4` but emitted noisy unsupported +> markers from unrelated `MUL_MAT_ID` candidates. Fix2 emits only the actual +> early pattern and is clean: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` +> `13/13`, early trace `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. It emits exactly six supported early markers for the +> perf sentinels, covering `n_tokens=128` and `257`, with `skip_ready=4`, +> `ids_match=1`, and `swiglu=1`. Trace overhead is within rule +> (`MOE_SWIGLU_DOWN n=128` `803.937 -> 808.978 us`, `-0.62%`; +> `n=257` `1020.412 -> 1026.073 us`, `-0.55%`). The next source phase can now +> implement a guarded executor at this early matcher. First prove safe +> ownership/skip accounting for the five-node sequence, then move route-plan +> reuse and activation/down execution into the helper. +> +> 2026-07-02 Phase121 update: keep the default-off executor proof after fix1. +> Initial artifact: +> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041543`; +> fix1 artifact: +> `/home/mudler/bench/phase121_moe_whole_pattern_exec_proof/20260702_041739_fix1`. +> The initial run passed correctness but emitted zero exec markers because the +> exec branch was accidentally nested under the early-trace env condition. +> Fix1 makes `LLAMA_MOE_WHOLE_PATTERN_EXEC=1` engage independently. Gates are +> green: selected `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE` `13/13`, exec +> `MOE_SWIGLU_DOWN` `7/7`, canonical MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Exec perf emits +> six `skip=4` markers covering `n_tokens=128` and `257`, and target perf is +> neutral (`MOE_SWIGLU_DOWN n=128` `807.772 -> 806.051 us`, `+0.21%`; +> `n=257` `1021.115 -> 1020.839 us`, `+0.03%`). This proves ownership and skip +> accounting only; it is not a fused-MoE speedup. The next source phase should +> replace one internal boundary inside this helper, preferably route-plan reuse +> or activation in route-slot order, with the same md5/op gates. +> +> 2026-07-02 Phase122 update: reject route-only metadata reuse inside the +> Phase121 executor. Artifact: +> `/home/mudler/bench/phase122_moe_shared_route_meta/20260702_043212`. +> The candidate exposed `ggml_cuda_mmq_ids_meta` as a public MMQ helper and +> used `LLAMA_MOE_WHOLE_PATTERN_SHARED_ROUTE=1` to build route metadata once +> for both `gate_up` and `down`. Correctness passed (`13/13` selected and +> `7/7` shared-route), but perf missed the keep gate: +> `MOE_SWIGLU_DOWN n=128` regressed `808.190 -> 811.836 us` and `n=257` +> regressed `1020.850 -> 1051.666 us` versus the Phase121 executor. Source was +> reverted, including the public metadata API and shared-route env. Post-reject +> gates on the reverted tree passed (`13/13` selected and `7/7` Phase121 exec) +> with six retained exec markers. Do not retry route-only metadata reuse. The +> next MoE executor scope should target activation/down data layout, direct +> activation-to-down input, or a larger GEMM1->activation->GEMM2 fused boundary. +> +> 2026-07-02 Phase123 update: reject standalone fused-down activation +> quantization inside the Phase121 executor. Artifact: +> `/home/mudler/bench/phase123_moe_executor_fused_down_input/20260702_025811`; +> red check: +> `/home/mudler/bench/phase123_moe_executor_fused_down_input/red_20260702_025031`. +> The candidate used `LLAMA_MOE_WHOLE_PATTERN_FUSED_DOWN=1` to run `gate_up`, +> compute `silu(gate) * up` directly into the sorted NVFP4 down MMQ activation +> buffer, and launch the existing down MMQ kernel. Correctness passed +> (`13/13` selected, `7/7` fused-down, six fused markers), but perf was flat: +> versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` was +> `811.153 -> 810.618 us` and `n=257` was `1023.090 -> 1023.657 us`. +> Source was reverted, including the fused-down env, MMQ helper, and NVFP4 +> fused quant kernel. Post-reject gates passed (`13/13` selected, `7/7` +> Phase121 exec, six exec markers). Do not retry a single-boundary +> SwiGLU-to-down-quant shortcut; if continuing MoE source work, scope a full +> expert-major packed pipeline that owns `GEMM1->activation->GEMM2`, or pivot to +> another measured bottleneck. +> +> 2026-07-02 Phase124 update: current-stack graph-node serving was refreshed +> after the Phase122/123 rejections. Artifact: +> `/home/mudler/bench/phase124_current_moe_profile/20260702_031205`. +> Pre/post gates are green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. At `N=128`, prompt `128`, +> generation `64`, serving under graph-node profiling was +> `agg_tps 206.2`, `decode_agg_tps 320.3`, `prefill_tps 1536.4`, wall +> `39.738s`. Fine buckets are now `mmq_nvfp4 6074.78 ms` (`30.17%`) and +> `gdn_core 5888.31 ms` (`29.25%`), with `act_quant` only `674.88 ms` +> (`3.35%`). This explains why single-boundary activation/quant attempts were +> flat. The next source work must reduce one of the two dominant buckets: +> either a full expert-major MoE pipeline for `mmq_nvfp4`, or a default-off GDN +> decode/core experiment for `gdn_core`. Do not spend more GB10 time on +> route-only metadata reuse, fused-down quantization, or MoE tile-policy knobs +> unless a new profile makes those buckets material. +> +> 2026-07-02 Phase125 scoping update: two independent code explorers and a +> local GDN audit challenged the Phase124 fork in the road. The chosen next +> source attempt is the MoE side, but only as a first maintainable slice: +> implement a default-off MMQ sorted-output primitive behind +> `LLAMA_MOE_EXPERT_MAJOR_SORTED_OUT=1`, immediately unsort as a proof, and +> measure `MOE_SWIGLU_DOWN` before attempting the full +> `gate_up -> SWIGLU -> down` expert-major executor. Rationale: vLLM's portable +> advantage is keeping activations expert-major across both GEMMs and +> unpermuting once; Phase122/123 failed because they only touched route metadata +> or one activation boundary. Do not copy CUTLASS/FlashInfer pointer-array, TMA, +> or FP4 scale-swizzle internals. A small GDN patch is not funded by current +> evidence because previous decode/core micro-attempts already rejected the +> obvious geometry/store/broadcast/conv-state shortcuts. Plan: +> `docs/superpowers/plans/2026-07-02-moe-expert-major-sorted-output-phase125.md`. +> +> 2026-07-02 Phase125 result: reject the MMQ sorted-output plus immediate +> unsort proof. Artifact: +> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/20260702_033931`; +> post-reject: +> `/home/mudler/bench/phase125_moe_expert_major_sorted_output/post_reject_20260702_034232`. +> The candidate was default-off and correctness-clean (`13/13` default +> selected, `7/7` opt-in `MOE_SWIGLU_DOWN`, 12 sorted markers), but perf failed +> decisively: versus Phase121 exec, `MOE_SWIGLU_DOWN n=128` regressed +> `805.16 -> 888.76 us` and `n=257` regressed `1023.83 -> 1192.05 us`. +> Source was reverted. Post-reject gates are green: selected `13/13`, Phase121 +> exec `7/7` with six markers, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry a path that adds +> a sorted-output temporary and immediately unsorts. A future expert-major MoE +> attempt must keep sorted activations through the down GEMM and unpermute only +> once after the full FFN, or pivot to a larger GDN recurrence design. + +> 2026-07-02 Phase126 result: keep the grouped-MMQ presorted helper scaffold. +> The patch only touches `mmq.cu`/`mmq.cuh`, refactors the current MoE id path +> behind explicit `ids_src1`/`ids_dst`/`expert_bounds` metadata, and exposes a +> `src1_sorted` entry point for the future whole-MoE executor. Fixed artifact: +> `/home/mudler/bench/phase126_mmq_presorted_helper/fix1_20260702_040858`. +> Gates were green: selected `13/13`, MoE md5 +> `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +> `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`. Focused perf was neutral: +> `MOE_SWIGLU_DOWN n=128 805.99 us`, `MUL_MAT_ID_RAGGED_MOE n=128 +> 1243.85 us`, `MOE_SWIGLU_DOWN n=257 1018.74 us`, +> `MUL_MAT_ID_RAGGED_MOE n=257 1452.84 us`. This is not a parity win by +> itself; it is the dependency for Phase127 to keep `gate_up -> SWIGLU -> down` +> in expert-major order and unpermute only once after the full FFN. + +> 2026-07-02 Phase127 result: reject and revert the whole-MoE expert-major +> executor built on the Phase126 helper. Red: +> `/home/mudler/bench/phase127_moe_whole_expert_major/red_20260702_042125` +> passed by fallback with zero markers. Candidate green: +> `/home/mudler/bench/phase127_moe_whole_expert_major/green2_20260702_042916` +> passed default selected `13/13` and opt-in `MOE_SWIGLU_DOWN 7/7`, emitting +> six `LLAMA_MOE_WHOLE_EXPERT_MAJOR` markers after fixing the down-weight shape +> interpretation (`down_w` is `[n_ff, n_embd, experts]`). Perf artifact: +> `/home/mudler/bench/phase127_moe_whole_expert_major/perf_20260702_043104`. +> It failed the keep rule: `MOE_SWIGLU_DOWN n=128` regressed +> `802.57 -> 812.14 us`; `n=257` regressed `1023.25 -> 1039.36 us`; +> ragged standalone was essentially flat. Source was reverted. Post-reject: +> `/home/mudler/bench/phase127_moe_whole_expert_major/post_reject_20260702_043318` +> passed selected `13/13`, MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Do not retry the same +> fake-tensor whole-executor shape; the next MoE attempt must remove more +> temporary traffic or become a real fused grouped MMQ/SWIGLU/down path. A +> separate alternative is the previously scoped Qwen3Next BF16 GDN S-cache +> experiment, but that needs non-md5 numerical gates. + +> 2026-07-02 Phase128 result: reject/revert the Qwen3Next BF16 GDN S-cache +> selector probe for the current target. Artifact: +> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/default_20260702_043939` +> built and passed default gates (`GATED_DELTA_NET 48/48`, canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT`, `MUL_MAT_ID`). Verbose smoke +> artifact: +> `/home/mudler/bench/phase128_qwen3next_gdn_bf16_s_cache/smoke3_20260702_044434` +> showed the active decision model is `qwen35moe`, not Qwen3Next, and S cache +> remained `f32` under `LLAMA_QWEN3NEXT_GDN_S_CACHE_TYPE=bf16`. No true +> Qwen3Next GGUF was found on DGX. The relevant Qwen35/Qwen35MoE BF16 S-cache +> lever was already Phase81/82: it cut `gdn_core` but changed MoE md5 and +> missed the full f16-reference KL acceptance band. Do not retry this exact +> lever unless the quality gate is explicitly re-scoped or a real Qwen3Next +> model artifact is available. + +> 2026-07-02 Phase129 result: reject/revert the Qwen35/Qwen35MoE grouped Q/K +> broadcast probe for fused GDN. Plan: +> `docs/superpowers/plans/2026-07-02-qwen35-gdn-qk-grouped-bcast-phase129.md`. +> The candidate added a default-off `LLAMA_QWEN35_GDN_QK_BCAST=1` branch in +> `src/models/qwen35.cpp` and `src/models/qwen35moe.cpp`, reusing the existing +> Qwen3Next `ggml_gated_delta_net_set_bcast()` path. Default gates were green: +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/default_20260702_065445` +> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. A standalone opt-in gate +> artifact at `optin_20260702_065604` was invalid because +> `paged-inference-gates.sh` only passes completion env through `EXTRA_ENV`. +> The valid opt-in pre-gate from +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/decode_optin_20260702_070149/gate_pre` +> changed MoE md5 to `b773e2f032aa0e992626d486b321808e`, so profiling was +> stopped and the source was reverted. Post-reject: +> `/home/mudler/bench/phase129_qwen35_gdn_qk_bcast/post_reject_20260702_070258` +> passed canonical MoE/dense md5, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, +> and `MUL_MAT_ID 806/806`; rebuilt `libllama.so` has zero +> `LLAMA_QWEN35_GDN_QK_BCAST` strings. Do not retry this Qwen3Next +> grouped-broadcast port for Qwen35/Qwen35MoE under the current bit-exact md5 +> rule. + +> 2026-07-02 Phase130 result: current-stack graph-node serving profile refresh, +> measurement-only. Artifact: +> `/home/mudler/bench/phase130_current_stack_profile/20260702_070949`. Shape: +> MoE `q36-35b-a3b-nvfp4`, `N=128`, prompt `128`, generation `64`, +> `PARALLEL=128`, `CTX=131072`. Pre/post gates passed canonical MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Serving metrics: `agg_tps 208.0`, +> `decode_agg_tps 326.9`, `prefill_tps 1519.6`, `TTFT mean 8170.6 ms`, wall +> `39.38 s`, total kernel time `20.1559 s`. The profile confirms the live +> bottleneck remains split between `mmq_nvfp4 6009.52 ms` (`29.82%`) and +> `gdn_core 5891.40 ms` (`29.23%`). FA/mask cleanup is not funded: +> `get_rows 280.62 ms` (`1.39%`) and `fa 257.38 ms` (`1.28%`). The next source +> attempt must target a larger MoE/FFN-GEMM executor/kernel or a materially +> different GDN recurrent-state/packed-decode design, not another paged-mask, +> route-only, activation-only, grouped-broadcast, BF16-cache, or launch-geometry +> shortcut. + +> 2026-07-02 Phase131 result: source-selection challenge, no source changes. +> Plan: +> `docs/superpowers/plans/2026-07-02-fused-routed-ffn-phase131.md`. Two +> read-only explorers challenged the Phase130 fork. MoE/FFN-GEMM source work is +> not funded unless it becomes a real fused routed-FFN kernel/executor; another +> route-only, activation-only, W4A16, tile-policy, sorted-output, or fake +> executor patch is expected to repeat Phases 110-127. GDN source work is not +> funded unless it materially reduces f32 recurrent-state traffic without +> BF16/quality drift; launch geometry, gather/identity, producer/store fusion, +> BF16 S-cache, and grouped Q/K broadcast have already failed or changed md5s. +> The next active line is to audit vLLM's fused MoE design and llama.cpp's +> current whole-pattern executor hook for a default-off fused routed-FFN PoC. +> If that audit does not produce a concrete low-conflict hook, require a +> standalone CUDA PoC before touching llama.cpp source. +> +> 2026-07-02 Phase132 result: keep the new default-off routed-FFN PoC scaffold. +> Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-poc-phase132.md`. Artifact: +> `/home/mudler/bench/phase132_routed_ffn_poc/20260702_072725`. Source adds +> `ggml/src/ggml-cuda/moe-ffn.cu/.cuh` and a narrow hook in +> `ggml/src/ggml-cuda/ggml-cuda.cu` behind `LLAMA_MOE_ROUTED_FFN_POC=1`. +> The helper currently executes the baseline `gate_up -> SWIGLU -> down` +> sequence through the existing whole-pattern hook, so it is a scaffold, not a +> parity speedup. Initial incremental build failed until CMake was reconfigured +> to pick up the new globbed CUDA source; after `cmake -S . -B build`, build +> passed. Selected default and opt-in gates passed +> `MOE_SWIGLU_DOWN,MUL_MAT_ID_RAGGED_MOE 13/13`; opt-in emitted six exec +> markers and `libggml-cuda.so` contains one `LLAMA_MOE_ROUTED_FFN_POC` string. +> Default and opt-in canonical gates passed MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Focused perf was neutral (`808.32 -> 804.87 us` at +> n=128, `1023.36 -> 1022.71 us` at n=257). Next phase may replace the helper +> internals with a real fused routed-FFN slice; do not claim Phase132 itself as +> a speedup. +> +> 2026-07-02 Phase133 result: keep only as a default-off structural base, not a +> speedup. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-sorted-down-phase133.md`. +> Artifact: +> `/home/mudler/bench/phase133_routed_ffn_sorted_down/20260702_074651`. +> Source exposes `ggml_cuda_mmq_ids_meta`, adds raw +> `ggml_cuda_mul_mat_q_moe_sorted_f32(...)`, and adds +> `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`. The path executes baseline `gate_up` and +> `SWIGLU`, gathers the SWIGLU output into compact expert-sorted F32 rows, then +> calls raw MMQ down without fake tensors. Selected default, Phase132, and +> Phase133 gates passed `13/13`; Phase133 trace proved six +> `mmq_moe_sorted_raw` launches. Default and Phase133 canonical gates passed +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Perf was not a win: +> default `807.37/1020.76 us`, Phase132 `808.21/1018.87 us`, Phase133 +> `808.85/1026.87 us` for `n=128/257`. Next phase must fuse SWIGLU-to-sorted +> or SWIGLU-to-quant to remove this added gather/quant boundary; do not promote +> sorted-down as-is. +> +> 2026-07-02 Phase134 result: keep only as default-off fused-SWIGLU structural +> base, not a speedup. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-swiglu-phase134.md`. +> Artifact: +> `/home/mudler/bench/phase134_routed_ffn_fused_swiglu/20260702_075828`. +> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_SWIGLU=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`, passes `gate/up` views into the routed-FFN +> helper, computes `silu(gate) * up` directly into expert-sorted F32 rows, and +> calls the raw sorted-F32 down MMQ helper. The fused flag now implies the +> sorted-down path; `LLAMA_MOE_ROUTED_FFN_SORTED_DOWN=1` is not required. +> Selected opt-in gates passed `13/13`; trace proved six `mmq_moe_sorted_raw` +> launches; canonical opt-in gates passed MoE md5 `8cb0ce23`, dense md5 +> `5951a5b4`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Perf is mixed: default `804.92/1026.02 us`, Phase132 +> `808.00/1028.43 us`, Phase133 `808.07/1029.02 us`, Phase134 +> `810.61/1025.68 us` for `n=128/257`. It recovers n=257 but regresses n=128; +> next work must fuse SWIGLU directly into down-MMQ quant or remove another +> launch/buffer before this becomes a parity lever. +> +> 2026-07-02 Phase135 result: keep as current best default-off routed-FFN base, +> but not parity. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-fused-quant-phase135.md`. +> Focused artifact: +> `/home/mudler/bench/phase135_routed_ffn_fused_quant/20260702_081723`. +> Serving artifact: +> `/home/mudler/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102`. +> Source adds `LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1` on top of +> `LLAMA_MOE_ROUTED_FFN_POC=1`, computes `silu(gate) * up` directly into the +> NVFP4 MMQ activation layout, and launches raw down MMQ via +> `ggml_cuda_mul_mat_q_moe_quantized(...)`. Focused selected gates passed +> `13/13`; trace proved six `mmq_moe_quantized_raw` launches and zero +> `mmq_moe_sorted_raw` launches; canonical focused gates passed MoE md5 +> `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused perf: +> default `805.92/1031.06 us`, Phase134 `807.65/1027.51 us`, Phase135 +> `807.92/1024.97 us` for `n=128/257`. Serving at the Phase130 shape passed +> pre/post gates and improved decode aggregate t/s `326.9 -> 332.7`, while +> `mmq_nvfp4` dropped `6009.52 -> 5915.24 ms`; aggregate stayed `208.0`, prefill +> worsened `1519.6 -> 1475.1`, and total kernel time rose slightly +> `20.1559 -> 20.2498 s`. Next work should target remaining MoE overhead after +> fused quant (`mmq_fixup`, route/writeback, weighted combine), not another F32 +> intermediate. +> +> 2026-07-02 Phase136 result: reject and revert the separate post-down +> weighted-combine fuse. Plan: +> `docs/superpowers/plans/2026-07-02-routed-ffn-combine-phase136.md`. +> Focused artifact: +> `/home/mudler/bench/phase136_routed_ffn_combine/20260702_083727`. +> Serving artifact: +> `/home/mudler/bench/phase136_routed_ffn_combine_serving/20260702_085749`. +> The candidate added `LLAMA_MOE_ROUTED_FFN_COMBINE=1` on top of Phase135 and +> skipped the post-down `MUL(weights) -> VIEW* -> ADD*` tail with a separate +> F32 weighted-combine kernel. It was correctness-clean: expanded selected +> gates passed `20/20`, trace proved six combine markers plus six +> `mmq_moe_quantized_raw` launches and zero sorted launches, canonical gates +> passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `GATED_DELTA_NET 46/46`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Focused full-tail perf +> improved (`MOE_SWIGLU_COMBINE n=257` `428.53 -> 401.81 us` versus Phase135), +> but serving regressed versus Phase135: aggregate/decode t/s +> `208.0/332.7 -> 206.5/323.2`. Source and the sentinel test were reverted; +> post-reject Phase135 selected gates passed `13/13`. Do not retry a standalone +> post-MMQ combine launch as the next parity lever; any combine/finalize work +> needs a larger serving-visible fused writeback/finalize design. +> +> 2026-07-02 Phase137 result: reject the GDN launch-geometry retune with no +> source changes. Plan: +> `docs/superpowers/plans/2026-07-02-gdn-geometry-sweep-phase137.md`. +> Focused artifact: +> `/home/mudler/bench/phase137_gdn_geometry_sweep/20260702_091441`. +> Serving artifact: +> `/home/mudler/bench/phase137_gdn_geometry_serving/20260702_091740`. +> The env-only sweep tested existing `GDN_NW`/`GDN_CPW` knobs. The best focused +> candidate, `GDN_NW=4 GDN_CPW=1`, improved 1-token GDN rows +> (`hc=32,hs=128,kda=0` `6.793748 -> 4.713682 us`, KDA +> `7.790557 -> 5.194275 us`, grouped broadcast `5.967364 -> 3.407998 us`), +> but real serving regressed versus Phase135 despite clean pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Aggregate/decode t/s moved +> `208.0/332.7 -> 206.2/324.9`, total kernel time rose +> `20.2498 -> 20.7530 s`, and `gdn_core` worsened +> `5926.55 -> 6466.27 ms`. Do not promote or source-code a GDN geometry retune +> for this target. The next scoped source line is default-off MoE +> finalize/writeback inside the existing down-MMQ path, not a standalone +> post-MMQ combine launch. +> +> 2026-07-02 Phase138 attempt 1 update: keep the default-off finalize trace and +> full-tail sentinel scaffold; no runtime speedup claim yet. Plan: +> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +> Artifacts: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_092943` +> (`MOE_SWIGLU_DOWN` trace-only), +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093617_full_tail` +> (new full-tail sentinel), and +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_trace/20260702_093731_canonical` +> (canonical gates). The old `MOE_SWIGLU_DOWN` sentinel emitted six early +> routed-FFN records but no weighted tail. The new `MOE_SWIGLU_FINALIZE` +> sentinel passed default and Phase135-opt-in correctness (`7/7` each) and +> emitted six supported tail records with `tail_nodes=16`, `views=8`, and +> `adds=7`. Canonical patched-Phase93 gates passed MoE md5 `8cb0ce23`, dense +> md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Next work may +> implement default-off down-MMQ finalize/writeback against this sentinel first; +> keep serving promotion gated by Phase135 decode/aggregate/kernel-time +> thresholds. +> +> 2026-07-02 Phase138 attempt 2 update: keep the default-off down-MMQ +> finalize/writeback candidate as a narrow positive, but do not promote it or +> call parity. Plan: +> `docs/superpowers/plans/2026-07-02-moe-down-mmq-finalize-phase138.md`. +> Focused artifact: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_095927_focused`; +> canonical gates: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize/20260702_100202_canonical`; +> serving: +> `/home/mudler/bench/phase138_moe_down_mmq_finalize_serving/20260702_100330`. +> The candidate adds `LLAMA_MOE_ROUTED_FFN_FINALIZE_POC=1` on top of Phase135, +> zeroes the final output, atomically accumulates `down_sum * router_weight` +> from the down-MMQ path, and skips the strict weighted tail only after the +> finalize helper is selected. Focused `MOE_SWIGLU_FINALIZE` correctness passed +> for default, Phase135, and Phase138 (`7/7` each); canonical and serving +> pre/post gates passed MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Serving versus Phase135 moved +> aggregate/decode t/s `208.0/332.7 -> 209.3/333.5`, total kernel time +> `20.2498 -> 20.0489 s`, and `mmq_nvfp4 5915.24 -> 5802.87 ms`; however +> `ew_add` remains visible at `374.09 ms`, so this is only an incremental +> default-off improvement. Next work should reduce the remaining fan-in/writeback +> path more deeply or return to the dominant `gdn_core`/`mmq_nvfp4` buckets. +> +> 2026-07-02 Phase139 result: serving noise-floor repeat rejects treating the +> Phase138 one-off serving gain as source-funding evidence. Spec: +> `docs/superpowers/specs/2026-07-02-serving-noise-floor-phase139-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-serving-noise-floor-phase139.md`. +> Artifact: +> `/home/mudler/bench/phase139_serving_noise_floor/20260702_081901`. +> Seven identical current-binary Phase138 serving/profile runs all passed +> pre/post gates: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +> `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. The runtime variance was much +> larger than Phase138's one-off delta: aggregate throughput median +> `208.5 t/s`, stdev `2.8022`, CV `1.349%`, range `203.4..212.3`; wall CV +> `1.347%`; `mmq_nvfp4` CV `3.351%`. Keep Phase138 default-off as +> correctness-clean and focused-positive, but do not stack another +> finalize/MMQ micro-patch from serving evidence alone. Future serving claims +> need repeated A/B medians and must exceed `max(2.0%, 3 * same-binary stdev)`. +> The next source phase should pivot to a larger measured bucket, with GDN +> packed decode/prep now more defensible than another MoE finalize shortcut. +> +> 2026-07-02 Phase140 result: reject an immediate in-GDN Q/K +> L2-normalization patch. Spec: +> `docs/superpowers/specs/2026-07-02-gdn-decode-prep-trace-phase140-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-gdn-decode-prep-trace-phase140.md`. +> Artifact: +> `/home/mudler/bench/phase140_gdn_decode_prep_trace/20260702_085348`. +> The current Phase138 opt-in serving/profile shape passed pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Serving/profile reported aggregate/decode throughput +> `207.3/328.9 t/s`, wall `39.501 s`, total kernel `20.2002 s`, `GDN +> 6673.66 ms`, `gdn_core 5890.44 ms`, and `gdn_l2norm 100.30 ms`. The focused +> SQLite summary had `l2_norm_f32 100.3024 ms` versus +> `gated_delta_net_cuda 5804.7074 ms`. This is above the absolute +> three-sigma floor from Phase139 (`53.433 ms`) but below the planned `3%` of +> GDN-core materiality threshold at about `1.7%`, so prep-only L2 fusion is not +> source-funded. Next GDN work should be recurrence-level, packed-state, or +> datacenter-Blackwell-specific, not another prep micro-fusion. +> +> 2026-07-02 Phase141 result: decode-only GDN source claims must normalize by +> launch count or tightly control the capture window. Spec: +> `docs/superpowers/specs/2026-07-02-gdn-decode-noise-floor-phase141-design.md`. +> Plan: +> `docs/superpowers/plans/2026-07-02-gdn-decode-noise-floor-phase141.md`. +> Artifact: +> `/home/mudler/bench/phase141_gdn_decode_noise_floor/20260702_090428`. +> Five identical current-binary decode-only captures all passed pre/post gates: +> MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`. Raw `gdn_core_ms` median/stdev/CV was +> `1415.500/30.641/2.146%`, range `1410.300..1482.140 ms`, but launch counts +> drifted (`597`, `598`, `600`, `630`). Normalized `gdn_core_ms_per_launch` +> was stable: median/stdev/CV `2.359167/0.005399/0.229%`. Future GDN A/B +> source claims need repeated medians and must beat either `6.49%` raw +> `gdn_core` reduction or `2.0%` launch-normalized reduction. The small +> default-off source follow-up now worth testing is scalar gate/beta hoisting +> inside `gated_delta_net_cuda`; vLLM-style packed decode recurrence remains a +> larger redesign. + +Audience: an agent with **zero prior context** who has been told to "continue the GB10 vLLM-parity investigation" on the `llama-cpp-localai-paged` backend. + +This file is the **operational how-to**. It is the companion to `VLLM_PARITY_FINAL.md`, which is the **why / authoritative record** ("never re-litigate"). If the two ever disagree on a *fact*, `VLLM_PARITY_FINAL.md` and the bench artifacts it cites win; this file wins on *procedure* (how to ssh, lock, build, bench, profile). + +Read order for a cold start: +1. This file (TL;DR + hard gates + quickstart). +2. `VLLM_PARITY_FINAL.md` (the closed record, every number cites its artifact). +3. `.agents/vllm-parity-methodology.md` (the methodology: bit-exact gating, profile-don't-assume, both-engine ground truth). +4. The patch-series `README.md` (~44 KB, canonical backend doc) and `PAGED_BITEXACT_NOTE.md`. + +--- + +## 1. TL;DR STATE + +> 2026-07-01 Phase104-108 update: the current carried source line is still the +> Phase93 Qwen3Next grouped Q/K broadcast plus the Phase101/102 default-off +> cleanup candidates. Phase104/106 same-session serving showed the stack is +> md5/op clean but still far from vLLM: at `N=128`, paged/vLLM was about +> `0.66` on decode and `0.50-0.51` on aggregate; at `N=192/256`, vLLM remained +> faster and TTFT stayed about `3x` lower. Phase105 refreshed the grouped-MMQ +> trace and found no new host-side tile-policy lever. Phase107 proved the MoE +> structural correctness gates exist (`MOE_SWIGLU_DOWN 7/7`, +> `MOE_WEIGHTED_COMBINE 7/7`, `MUL_MAT_ID_RAGGED_MOE 6/6`) but also proved +> `test-backend-ops perf` did not time those custom whole-graph cases. Phase108 +> fixed that measurement gap in `tests/test-backend-ops.cpp`: perf mode now +> includes those MoE cases at `n_tokens=128,257`, and CSV output includes +> `time_us`, `flops`, `memory_kb`, and `n_runs`. The Phase108 artifact is +> `/home/mudler/bench/phase108_moe_perf_csv/20260701_221559`; md5s and compact +> op gates are green. Use Phase108 rows as the baseline for any fused routed-MoE +> implementation. Current ranking: `MUL_MAT_ID_RAGGED_MOE` is `1239-1446 us/run`, +> `MOE_SWIGLU_DOWN` is `802-1020 us/run`, and `MOE_WEIGHTED_COMBINE` is only +> `28-68 us/run`, so do not spend the next patch on weighted-combine fusion +> alone. +> Phase109 then tested existing env-gated routes on the Phase108 rows: +> `LLAMA_W4A16_PREFILL_M=128`, `LLAMA_FP4_PREFILL_M=128`, +> `LLAMA_MOE_DENSITY_MAX=9`, and `LLAMA_MOE_MMQ_X=64` +> (`/home/mudler/bench/phase109_existing_moe_prefill_ab/20260701_222559`). +> All selected correctness gates passed (`13/13` per env), but W4A16 and FP4 +> large-M regressed the 257-token rows badly, and density/tile retuning was +> noise-level on `MUL_MAT_ID_RAGGED_MOE` while not helping `MOE_SWIGLU_DOWN`. +> Do not spend another phase on MMQ tile-policy shortcuts. The next credible +> implementation is structural: port the vLLM-style idea of GPU-side +> token/expert routing metadata (`sorted_token_ids`, expert offsets/bounds, +> inverse permutation) into llama.cpp's `mul_mat_id` host-sync fallback/grouped +> W4A16 path, while leaving the graph-safe grouped-MMQ path untouched. +> Phase110 implemented the first slice of that structural path as default-off +> `LLAMA_MOE_GPU_SORT=1` in `ggml_cuda_mul_mat_id`, reusing the existing +> `mm_ids_helper` GPU sort for fallback metadata. The initial branch failed +> `3/13` selected opt-in rows because `mm_ids_helper` returns sorted-to-original +> `ids_dst`, while fallback `get_rows_cuda()` needs original-to-sorted +> `ids_from_sorted`; adding a tiny inverse-permutation kernel fixed correctness. +> Accepted artifact: +> `/home/mudler/bench/phase110_gpu_moe_sort/20260701_224446_fix1`. Gates are +> green: canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +> `5951a5b4d624ce891e22ab5fca9bc439`, and supported compact ops +> `SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, `GET_ROWS 49/49`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806` for both +> default and `LLAMA_W4A16_PREFILL_M=128 LLAMA_MOE_GPU_SORT=1`. Perf decision: +> keep as a default-off structural base only. It improves W4A16 fallback +> 257-token rows by `7.2%` (`MOE_SWIGLU_DOWN`) and `7.9%` +> (`MUL_MAT_ID_RAGGED_MOE`), but the opt-in fallback is still about `1.5x` +> slower than default grouped-MMQ. Phase111 must remove another fallback +> bottleneck, such as the remaining `expert_bounds` host copy / host tile +> descriptor build, before this line can matter for parity. +> Phase111 tested that narrow follow-up as default-off `LLAMA_W4A16_GPU_TILES=1`: +> W4A16 tile descriptors were built on GPU from `expert_bounds_dev` with an +> atomic tile counter. It was correctness-clean after fixing a pointer mutability +> compile error and a CUDA pool LIFO allocation bug, but clean perf was +> flat-to-negative (`MUL_MAT_ID_RAGGED_MOE n=257` regressed about `2.0%` versus +> Phase110 GPU-sort). Artifact: +> `/home/mudler/bench/phase111_w4a16_gpu_tiles/20260701_230400_fix1`. The +> Phase111 source was reverted, and post-revert W4A16+GPU-sort selected gates +> passed `13/13`. Do not reopen a standalone GPU tile descriptor cleanup; the +> next W4A16 attempt must remove a larger boundary, such as direct activation +> consumption plus GPU descriptors together, or bypass the host-sync fallback +> path entirely. +> +> 2026-07-01 active update: Phase50-59 reopened the dense and MoE serving +> scheduler question. +> True dense decode is much closer to vLLM (`383.66` vs `435.00` t/s, `88.2%`) +> than the Phase47 h2h aggregate suggested, while traced serving still shows +> no pure decode-only steps and high TTFT. Phase53 rejected static lower +> admission budgets; Phase54 histograms show prompt admission concentrated in a +> few large chunks (`prompt_hist=513+:12`) with mostly full-width decode +> (`decode_hist=128-255:53`). Phase55 implemented that targeted +> first-token A/B as `LLAMA_TTFT_PREFILL_FIRST=1`: on dense `n=128` it improved +> aggregate throughput `138.2 -> 142.9`, mean TTFT `23231.9 -> 21520.8 ms`, and +> wall `59.272 -> 57.323 s`, with md5/op gates green. Phase56 then showed the +> policy helps dense `n=32` but regresses MoE `n=128` mean TTFT +> `7168.1 -> 7615.3 ms` and aggregate `341.1 -> 339.9`; keep it opt-in only and +> do not default it broadly. Phase57 tried a per-step defer cap; cap32 improved +> MoE mean TTFT in one same-window sweep but still lost aggregate and wall, and +> dense caps lost aggregate. Phase58 added a prompt-backlog threshold; min32 +> improved MoE `n=128` aggregate `339.0 -> 341.9`, mean TTFT +> `7743.1 -> 7420.1 ms`, and wall `24.167 -> 23.950 s` in the same window, while +> dense `n=128` was mixed. Phase59 repeated MoE min32: aggregate stayed flat +> (`336.6 -> 336.9`), mean TTFT improved (`7798.5 -> 7167.8 ms`), and wall stayed +> flat (`24.334 -> 24.316 s`) with md5/op gates green. Matching vLLM was still +> far ahead (`601.3` aggregate, `2968.1 ms` mean TTFT), so min32 is an opt-in +> llama.cpp QoS knob, not a parity-closing lever. The trace and scheduler commits +> are local and DGX-gated but not pushed, so the LocalAI patch series has not +> been regenerated. +> +> 2026-07-01 Phase81-85 update: the next viable GDN lever is no longer launch +> shape or gather removal. A default-off Qwen35/Qwen35MoE BF16 persistent +> recurrent S-cache experiment (`LLAMA_QWEN35_GDN_S_CACHE_TYPE=bf16`) cut +> same-source decode-only `gdn_core` from `1399.30 ms / 599 launches` +> (`2.34 ms/launch`) to `863.57 ms / 720 launches` (`1.20 ms/launch`). Default +> F32 md5 gates and op gates stayed green, and BF16 dense md5 stayed canonical, +> but BF16 MoE md5 changed to `07db32c2bcb78d17a43ed18bc22705cd`. A quick +> MoE KL smoke vs the same-source F32 base showed KLD `0.055499 +/- 0.001705`, +> same-top-p `88.361%`, and PPL ratio `1.010356`. Phase82 then ran the full MoE +> f16-reference gate at +> `/home/mudler/bench/phase82_bf16_s_cache_f16_kl/20260701_183016`: same-source +> F32 measured KLD `0.136563 +/- 0.003242`, while BF16 S-cache measured +> `0.137162 +/- 0.003456` against the documented paged acceptance reference +> `0.136000 +/- 0.003285`. Reject promotion and do not run serving A/B for this +> candidate under the current hard KL rule. Phase83 then tested a bit-exact +> KDA `expf(g)` register-cache shortcut in the GDN CUDA core. Md5 and op gates +> stayed green, but same-window decode-only `gdn_core` moved +> `1399.46 -> 1405.62 ms`, so reject that micro-optimization too. Phase84 +> reduced in-place GDN op outputs to attention-only tensors and moved the CPU +> ids fallback scratch to workspace; md5/op gates stayed green and startup free +> CUDA memory improved `117472 -> 117855 MiB`, but same-window decode-only +> `gdn_core` moved `1399.72 -> 1407.38 ms`. Treat Phase84 as a possible +> memory-footprint cleanup only, not a speed parity lever. Phase85 added a +> graph-reuse-safe identity-contiguous recurrent-state fast path: it calls +> `ggml_gated_delta_net_inplace` on a direct state view when `s_copy_main` is +> identity, otherwise keeps the ids path. Md5/op gates stayed green, the +> `gdn_gather` fine bucket disappeared, GDN macro launches dropped +> `3600 -> 2980`, and same-window `gdn_core` moved `1412.33 -> 1400.34 ms`. +> Carry Phase85 only as a small cleanup candidate. Phase86 audited the producer +> fusion idea against the Phase85 node-traced profile before coding it: the +> whole `act/GDN-gate(shared)` macro is only `13.57 ms` of `3.6622 s`, beta +> sigmoid is `2.73 ms`, and CUDA already fuses `UNARY + MUL` for softplus, +> sigmoid, and SILU. Reject producer-only fusion as too small. Phase87 then +> exposed an env-gated `GDN_NW=4 GDN_CPW=8` decode geometry probe to test a +> vLLM-like `BV=32` tile shape. It was md5/op green, but same-source +> decode-only `gdn_core` regressed `1390.56 -> 1417.13 ms`, so the source line +> was reverted. Phase88 tried a first default-off `GDN_DECODE_PACK2=1` packed +> decode CTA kernel. It built and CUDA op tests stayed green, but canonical md5 +> failed for both MoE (`320b5ed...` vs `8cb0ce...`) and dense (`6a65e9...` vs +> `5951a5...`), with visible output corruption, so it was reverted without +> profiling. Phase89 tried to add that focused guardrail through +> `test_gated_delta_net_inplace_ids`, but selecting that test class directly +> already fails the pre-existing BF16 cases on CUDA, so the naive test addition +> was also reverted. Phase90 fixed the fixture root cause for identity ids by +> mirroring `state` into `state_dst` during initialization and added F32 +> `S_v=128`, `n_seqs=2` cases that return `concat(out,state_dst)`, so the +> backend comparator now checks both attention output and the side-effect state +> write. DGX CUDA selected-op gate is green (`4/4`). Use this Phase90 guardrail +> before any new packed-decode kernel, then still run canonical md5/op gates. +> Phase91 retried the default-off `GDN_DECODE_PACK2=1` CTA sequence-packing +> kernel under that guardrail. The first `n_seqs=2` guardrail passed but MoE md5 +> failed for the single-sequence completion gate, exposing an uncovered odd/single +> sequence PDL hazard. Moving inactive lanes past `ggml_cuda_pdl_sync()` and +> adding `n_seqs=1,3` guardrail cases made the candidate md5/op clean +> (`GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but +> decode-only `gdn_core` regressed to `1425.44 ms`, so the runtime patch was +> reverted. Keep the expanded guardrail; do not retry CTA-level sequence packing +> unless it also reduces per-sequence GDN work. ids gather, producer overhead, +> simple geometry changes, and ungated packed kernels are not acceptable parity +> paths. Phase92 tried the next smallest scalar one-token recurrence +> micro-optimization: a default-off `GDN_SCALAR_DECODE_STORE_FUSED=1` CUDA path +> that stores final state inside the scalar update loop and skips the final +> post-token register-store loop. It passed local CPU guardrail, DGX CUDA +> guardrail, canonical md5s, `GATED_DELTA_NET 46/46`, `MUL_MAT 1146/1146`, and +> `MUL_MAT_ID 806/806`, but decode-only `gdn_core` regressed further to +> `1529.72 ms` (`/home/mudler/bench/phase92_gdn_scalar_store_fused/20260701_204718/decode_profile`), +> so the runtime patch was reverted. Do not retry store-fusing without evidence +> that the final state store loop is independently dominant. The next credible +> scoped ideas from the vLLM audit are the larger packed decode contract and the +> Qwen3Next GQA-repeat removal, each as a separate guarded phase. Phase93 +> implemented the Qwen3Next GQA-repeat removal as an explicit grouped Q/K +> broadcast mode on `GGML_OP_GATED_DELTA_NET` (`op_params[2]`), preserving the +> existing modulo/tiled broadcast for Qwen35 while allowing Qwen3Next to map +> `qk_head = value_head / (H_v / H_k)` and skip materializing repeated q/k heads +> when the GDN op path is active. Local CPU `GATED_DELTA_NET` passed `48/48`, +> local CPU in-place ids passed `6/6`, DGX CUDA `GATED_DELTA_NET` passed `48/48`, +> DGX CUDA in-place ids passed `6/6`, canonical md5/op gates passed +> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), and +> decode-only `gdn_core` improved to `1333.48 ms` +> (`/home/mudler/bench/phase93_qwen3next_gqa_bcast/20260701_211019/decode_profile`). +> Carry Phase93 as the current positive candidate. Phase94 then retested +> decode geometry on top of Phase93 with env-only `GDN_NW=8 GDN_CPW=8`. It +> stayed md5/op clean (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`) but decode-only `gdn_core` regressed to `1440.79 ms` +> (`/home/mudler/bench/phase94_gdn_geometry_phase93/20260701_211855/decode_profile_8x8`), +> so reject 8x8 and keep Phase93's default 16x8 geometry. Phase93 trace evidence +> also shows remaining producer-side GDN work is small (`l2_norm_f32 8.65 ms`, +> GDN gate/sigmoid about `12.75 ms`, remaining repeat `5.34 ms`), so the next +> useful lead should target recurrence work or a larger packed decode contract, +> not another small producer-only fusion. Phase95 tested a default-off +> `GDN_WARP_SCALAR_GATE=1` CUDA decode specialization on top of Phase93: lane 0 +> computed the scalar non-KDA gate and broadcast it within the warp for the +> one-token `S_v=128`, default `16x8` path. Local CPU guardrails passed +> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), DGX CUDA guardrails passed +> (`GATED_DELTA_NET 48/48`, in-place ids `6/6`), canonical md5/op gates passed +> (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`), but +> decode-only `gdn_core` regressed to `1402.40 ms` +> (`/home/mudler/bench/phase95_gdn_warp_scalar_gate/20260701_213311/decode_profile`). +> The runtime patch was reverted. Do not retry scalar-gate warp broadcast unless +> a future profile shows SFU pressure, rather than recurrent state traffic or +> reductions, dominating the GDN core. Phase96 then tested the narrow +> conv-state identity fast path suggested by the trace audit: when +> `s_copy_main` was identity, `build_conv_state_fused` viewed the active +> conv-cache slots directly and called `ggml_ssm_conv_update_inplace` instead of +> the ids variant. Local CPU `SSM_CONV` passed `45/45`; DGX CUDA `SSM_CONV` +> passed `45/45`; canonical gates passed (`SSM_CONV 45/45`, +> `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, md5s +> canonical). Decode-only profile regressed to total kernel `3.6723 s`, +> `gdn_core 1406.57 ms`, and `gdn_conv 70.42 ms` +> (`/home/mudler/bench/phase96_conv_identity_fastpath/20260701_214141/decode_profile`). +> The runtime model-graph patch was reverted. Do not retry the conv identity +> branch as a speed lever unless a same-window trace proves the ids variant is +> independently dominant. Phase97 then measured the carried Phase93 stack in an +> end-to-end `n=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128` serving snapshot +> against vLLM. Pre/post canonical gates stayed green. Paged Phase93 measured +> `agg_tps 329.6`, `decode_agg_tps 669.8`, `prefill_tps 1734.5`, +> `ttft_mean_ms 7415.4`, `wall_s 24.851`; vLLM measured `agg_tps 664.8`, +> `decode_agg_tps 1029.4`, `prefill_tps 5271.8`, `ttft_mean_ms 2519.5`, +> `wall_s 11.929` +> (`/home/mudler/bench/phase97_phase93_serving_snapshot/20260701_214648`). +> Phase93 therefore remains a decode-profile positive candidate, but it does not +> close serving parity (`paged_decode_over_vllm=0.6507`). The next useful phase +> needs a larger serving-impact lever; isolated GDN/conv micro-optimizations +> have now repeatedly failed to move live serving enough. Phase98 profiled that +> carried Phase93 serving window with graph-node CUDA tracing. Pre/post gates +> stayed green. Total kernel time was `20.0411 s`; macro buckets were GDN +> `6679.96 ms` (`33.33%`), MoE/FFN-GEMM `6034.52 ms` (`30.11%`), +> bf16/fp8-proj `2766.06 ms` (`13.80%`), and layout-copy `1257.60 ms` +> (`6.28%`). Fine buckets were led by `gdn_core 5892.99 ms` (`29.40%`) and +> `mmq_nvfp4 5809.55 ms` (`28.99%`), followed by `convert_dtype 663.45 ms`, +> `gdn_conv 457.11 ms`, and `concat_layout 430.25 ms` +> (`/home/mudler/bench/phase98_phase93_serving_profile/20260701_215715`). +> This re-ranks the next work: do not spend more time on scalar GDN, conv +> identity, or gather-only shortcuts. Either attribute and remove a proven +> material layout-copy node, or pursue a larger GDN-core/MMQ serving lever with a +> standalone PoC gate. Phase99 then used the existing default-off +> `LLAMA_LAYOUT_TRACE` hook on the same Phase93 serving profile shape +> (`N=128`, `PTOK=128`, `GEN=64`, `PARALLEL=128`). Trace-enabled gates stayed +> green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, +> canonical MoE/dense md5s). Serving remained comparable (`total kernel +> 20.2408 s`, `layout-copy 1269.35 ms`). The trace attributed +> `concat_layout 440.01 ms` almost entirely to +> `conv_input-* = concat(conv_states_reshaped-*, qkv_mixed_transposed-*)` before +> `SSM_CONV`; `copy_layout 119.16 ms` includes `conv_state_update-*` writeback. +> The larger `convert_dtype 662.34 ms` bucket is mostly unnamed F32-to-F16 `CPY` +> rows and needs stronger attribution before coding. Decision: Phase99 is +> measurement-only; do not retry the Phase96-style conv-state identity branch. +> The only conv-side patch worth funding is a larger two-source `SSM_CONV` +> contract that reads `(conv_states, qkv_mixed)` as a logical concat, or else +> extend trace attribution for the unnamed `convert_dtype` bucket first +> (`/home/mudler/bench/phase99_layout_trace/20260701_200835/serving_profile`). +> Phase100 extended that trace with `dst_view`, `src0_view`, and `src1_view` +> names. The trace-only patch built locally and on DGX, and trace-enabled gates +> stayed green (`GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Serving stayed comparable +> (`total kernel 20.3464 s`, `convert_dtype 661.73 ms`, `concat_layout +> 438.15 ms`). The new fields identify a concrete `convert_dtype` source: +> `GET_ROWS` reads F16 `cache_k_l*` / `cache_v_l*` into F32 `node_*`, then +> `CPY` downcasts views such as `src0_view=node_358` / `node_365` to F16 +> attention-shaped tensors. This repeats across attention layers +> (`cache_k_l3/v_l3`, `cache_k_l7/v_l7`, `cache_k_l11/v_l11`, ...). Some F32->F16 +> rows remain unnamed, so the next runtime phase should be a narrow K/V cache +> get_rows dtype A/B, not a broad layout rewrite +> (`/home/mudler/bench/phase100_layout_view_trace/20260701_201800/serving_profile`). +> Phase101 implemented that narrow A/B as default-off +> `LLAMA_PAGED_KV_GET_ROWS_F16=1`: add `ggml_get_rows_type`, support CPU F16 +> source -> F16 destination row copy, and use typed F16 `GET_ROWS` only for +> paged K/V gather when the cache tensor is F16. Local and DGX builds completed; +> CUDA `GET_ROWS` passed `49/49` including the new F16-output cases; default and +> opt-in md5/op gates stayed green (`GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, +> `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). +> Serving profile under opt-in measured `total kernel 20.1989 s`, `agg_tps +> 206.4`, `decode_agg_tps 328.0`, and `ttft_mean_ms 8211.1`. It reduced +> `copy_layout 116.25 -> 80.32 ms` and macro `layout-copy 1262.58 -> 1220.30 ms` +> versus Phase100, but `convert_dtype` stayed flat (`661.73 -> 661.35 ms`) and +> serving throughput did not improve. Carry Phase101 only as a small default-off +> cleanup candidate pending repeat A/B; do not promote it as a parity lever +> (`/home/mudler/bench/phase101_kv_get_rows_f16/20260701_203930/serving_profile`). +> Phase102 then implemented the funded two-source `SSM_CONV` contract as +> default-off `LLAMA_SSM_CONV_SPLIT=1`: `ggml_ssm_conv_split(ctx, conv_states, +> x_cur, conv_kernel)` reuses `GGML_OP_SSM_CONV`, reads +> `[K-1,channels,n_seqs]` cached taps plus native `[channels,n_tokens,n_seqs]` +> qkv tokens as a logical concat, and is wired into Qwen3Next/Qwen35/Qwen35MoE +> only for multi-token, non-rollback batches with `n_seq_tokens >= K-1`. The +> initial semantic test exposed a harness issue (`split-base` has an exactly +> zero CPU reference, so normalized MSE reported `ERR=inf`); direct split +> CUDA-vs-CPU passed `6/6`, and the final test keeps `split-base` with absolute +> max error. Local and DGX builds passed; default, standalone opt-in, and +> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, +> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Opt-in serving measured +> `total kernel 19.5482 s`, `agg_tps 206.1`, `decode_agg_tps 320.0`, +> `prefill_tps 1538.0`, and `ttft_mean_ms 7928.4`. It removed the traced concat +> materialization (`concat_layout 433.13 -> 4.59 ms` versus Phase101 and +> `layout-copy 1220.30 -> 826.87 ms`), but live serving throughput still did not +> improve. Carry Phase102 as a default-off cleanup/follow-up base only; do not +> promote it as parity-closing without a repeat A/B or an additional state-update +> fusion. The remaining high-value targets are still `gdn_core`, `mmq_nvfp4`, or +> a larger serving scheduler/packed-decode contract +> (`/home/mudler/bench/phase102_ssm_conv_split/20260701_210907/serving_profile`). +> Phase103 measured Phase101+Phase102 together, with no new source changes: +> `LLAMA_SSM_CONV_SPLIT=1 LLAMA_PAGED_KV_GET_ROWS_F16=1`. Standalone and +> serving pre/post gates stayed green (`SSM_CONV 45/45`, `SSM_CONV_SPLIT 6/6`, +> `GET_ROWS 49/49`, `GATED_DELTA_NET 48/48`, `MUL_MAT 1146/1146`, +> `MUL_MAT_ID 806/806`, canonical MoE/dense md5s). Combined serving improved +> over Phase102 (`agg_tps 206.1 -> 212.3`, `decode_agg_tps 320.0 -> 331.5`, +> `prefill_tps 1538.0 -> 1569.1`, `wall_s 39.743 -> 38.575`) and reduced +> `layout-copy 826.87 -> 798.52 ms`; it also preserved most of the split +> SSM_CONV concat removal and recovered the F16 K/V `copy_layout` reduction +> (`copy_layout 112.53 -> 78.22 ms`). This proves the two cleanup candidates are +> compatible, but not parity-closing: `gdn_core 5930.47 ms` and `mmq_nvfp4 +> 6001.77 ms` still dominate. Carry the combined env as the cleanup comparison +> baseline; do not rerun isolated layout cleanup unless it changes a larger +> serving contract +> (`/home/mudler/bench/phase103_combined_layout_cleanups/20260701_211821/serving_profile`). +> Phase104 then measured that combined cleanup stack in the normal same-session +> serving harness against vLLM at `N=128`, `PTOK=128`, `GEN=64`, +> `PARALLEL=128`. Pre/post gates stayed green with the same expanded op set and +> canonical md5s. Paged combined measured `agg_tps 338.6`, +> `decode_agg_tps 675.8`, `prefill_tps 1813.0`, `ttft_mean_ms 7121.6`, and +> `wall_s 24.196`; vLLM measured `agg_tps 661.1`, `decode_agg_tps 1028.0`, +> `prefill_tps 5208.7`, `ttft_mean_ms 2572.3`, and `wall_s 11.980`. This is a +> small serving improvement over Phase97 (`agg_tps +2.73%`, `prefill_tps +> +4.53%`, `TTFT -3.96%`), but still not parity: `paged_decode_over_vllm=0.6574` +> and `paged_agg_over_vllm=0.5122`. Carry the combined cleanup stack as the best +> current comparison baseline. The next useful phase must attack a larger +> serving-impact contract or the dominant GDN/MMQ buckets, not more isolated +> layout-copy cleanup +> (`/home/mudler/bench/phase104_combined_serving_snapshot/20260701_212551`). +> Phase105 refreshed grouped-MMQ evidence on that current stack without source +> changes. `MUL_MAT_ID_RAGGED_MOE` stayed green both default and trace-enabled +> (`6/6`), full `MUL_MAT_ID` stayed green (`806/806`), and the live serving +> retry returned a non-empty response while recording `120` shape and launch +> lines. The live sample was prefill-like (`ncols_max=317`, density `10`, +> `mmq_x_best=112`, `stream_k=1`) with no small-M lines; all launches had +> `fixup=0`, `stream_k_blocks == ntiles_dst`, and efficiency `100`. This +> confirms the current cleanup stack did not open a new cheap MMQ shortcut. +> Do not add another host-side MMQ tile policy; only revisit MMQ for a +> genuinely structural kernel or serving-contract change +> (`/home/mudler/bench/phase105_mmq_current_shape/20260701_214129_serving_retry`). +> Phase106 tested the remaining low-conflict C1 operating-point hypothesis on +> the current stack: same-session `N=128/192/256` with `PARALLEL=256`, +> `VLLM_MAX_NUM_SEQS=256`, and the combined cleanup env. Pre/post gates stayed +> green with canonical md5s and the expanded op set. vLLM completed all legs and +> stayed ahead: at `N=256`, paged measured `agg_tps 338.4`, +> `decode_agg_tps 824.6`, `ttft_mean_ms 14933.5`, while vLLM measured +> `agg_tps 723.8`, `decode_agg_tps 1320.4`, `ttft_mean_ms 4999.0`. Reject C1 +> for the current GB10 stack. The next source phase should be structural +> persistent-batch/fused-MoE/GDN work, not another scheduler shortcut +> (`/home/mudler/bench/phase106_max_concurrency_current_stack/20260701_214907`). +> Phase107 established the fused-MoE structural guardrail surface before coding: +> `MOE_SWIGLU_DOWN 7/7`, `MOE_WEIGHTED_COMBINE 7/7`, and +> `MUL_MAT_ID_RAGGED_MOE 6/6` passed on CUDA0. However, +> `test-backend-ops perf` did not provide usable timing rows for these custom +> whole-graph cases; the broad `MUL_MAT_ID` perf CSV reported support metadata +> only. The next source patch should be measurement-only: add a narrow MoE +> fusion timing harness with explicit GPU synchronization and CSV timing before +> funding any fused routed-MoE kernel +> (`/home/mudler/bench/phase107_moe_fusion_guardrail/20260701_220227`). + +- Historical verdict: the older investigation marked GB10 parity **CLOSED** and + unreachable. Treat that as superseded where Phase50-54 provide newer dense + serving evidence. +- **Prefill** is a genuine floor at **~36% (MoE) / ~43% (dense)** of vLLM. Prefill is **not** CUDA-graph-replayed, so these numbers are real, not measurement artifacts. +- **Decode** is **near-parity: ~86% of vLLM's TRUE GPU-steady decode** (924 vs 1078 t/s). The long-standing **~56% headline was a CUDA-graph measurement artifact** (nsys without `--cuda-graph-trace=node` collapses each graph replay into one opaque launch). Decode is also **ahead of vLLM at low concurrency** (dense 116.7% at N=8) and uses **1.5-3x less memory**, bit-exact per-path. +- The lever search was **exhaustive**: every attempt (prefill GEMM, GDN chunked scan, decode fusions, serving/scheduler) is recorded with its verdict and number so it is **not re-run**. +- **The path to parity is different hardware: datacenter Blackwell** (B200, HBM, native tcgen05 / CUTLASS FP4). Do NOT reopen GB10 kernels. Re-run the methodology on the new silicon, where vLLM's GB10-losing FLA/Marlin kernels invert. + +--- + +## 2. THE HARD GATES YOU MUST NOT VIOLATE + +These are non-negotiable. Violating any of them invalidates the result or the contribution. + +### 2.1 The per-path greedy-md5 bit-exact gate (sacred) +The gate is **per-path**: paged vs non-paged attention legitimately produce different (equivalent) FP-reduction orders. Each path is gated against **its own** reference, validated benign by KL-divergence to the f16 reference. Canonical greedy md5s: + +| Path | Model | Canonical md5 | +|---|---|---| +| non-paged | MoE q36-35b-a3b-nvfp4 | `07db32c2bcb78d17a43ed18bc22705cd` | +| **paged** | MoE q36-35b-a3b-nvfp4 | `8cb0ce23777bf55f92f63d0292c756b0` | +| non-paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` | +| paged | dense q36-27b-nvfp4 | `5951a5b4d624ce891e22ab5fca9bc439` (bit-exact to non-paged) | + +- **Compare paged-to-paged only.** Future paged-MoE regressions compare to `8cb0ce23`, NOT `07db32c2`. +- **Why paged-MoE differs (benign, KL-validated):** `llama-perplexity --kl-divergence` on the MoE GGUF (16 chunks, f16 base PPL 7.3734) shows non-paged-vs-f16 KLD 0.136597 and paged-vs-f16 KLD 0.136000, i.e. paged does NOT diverge from f16 ground truth more than non-paged does. Paged and non-paged are two equivalent FP-reorderings of the same 4-bit model. This holds on the 0028 baseline and with `LLAMA_MOE_FORCE_GRAPHS`/0029 on or off, so it is a property of the paged path, not any one lever. +- **Every bit-exact patch is gated two ways:** greedy md5 (per path) AND `test-backend-ops` vs the CPU oracle for every touched op. + +### 2.2 The KL-gate for opt-in lossy paths +Any path that is NOT byte-identical (e.g. 0033 dequant-bf16, the 0034/0035 large-M FP paths, FP8-KV) ships **default-off** and is gated by a **KL-divergence band**: it requires `KLD(new||f16) <= KLD(FP4-MMQ||f16)` and PPL within the established band. Lossy levers never ship default-on. + +### 2.3 In-backend A/B is the only proof (hard methodology rule) +A lever compiled into the binary is **NOT** isolated by a runtime flag alone. It needs a **separately-built in-backend A/B**. Precedents that burned this in: 0031 chunking math was correct yet -22% in-backend; 0034 had a standalone PoC win that did not hold in-backend. + +### 2.4 Contribution / commit gates (LocalAI policy) +- **DCO sign-off is human-only:** do not add an AI `Signed-off-by` trailer. +- **AI attribution via `Assisted-by:` trailer:** `Assisted-by: Codex:gpt-5`. +- **NEVER add `Co-Authored-By:` (AI) trailers** and never add an AI `Signed-off-by`. +- **No em-dashes** anywhere in output (use `-`, `:`, parentheses, or rephrase). +- **Ask before every `git push`.** Prior approval does not carry over. + +### 2.5 Fork-first is MANDATORY (the fork is canonical) +- The **canonical source of truth is the fork branch `mudler/llama.cpp:localai-paged`** (= pin commit + paged patch commits in order). It is canonical for ALL paged-backend kernel/patch work. The shipped `patches/paged/*.patch` series is a **derivative**: the fork is the source. +- **Always update the fork FIRST, in this exact order:** (1) commit the change on the `localai-paged` branch and **push it**, then (2) regenerate the LocalAI series (`backend/cpp/llama-cpp-localai-paged/patches/paged/`) from the fork via `git format-patch` (one patch per fork commit, source-only, never touching a `*.md`/dev-doc), so the series stays a **1:1, drift-free mirror** of the branch. No hand-export. +- **NEVER edit the LocalAI `patches/paged/*.patch` files directly**, and **NEVER add a patch to the series with no corresponding fork-branch commit.** They are generated output, not source. +- The fork branch is also **where the build and the per-path bit-exact md5 gate actually run**, so it is the **only** place a change is truly validated. A patch that lives only in the LocalAI series has never been built or gated. +- **Mirror invariant (verify by tree hash):** applying the full on-disk series on the pin must reproduce the fork branch tree byte-for-byte. The series has **intentional gaps** (missing 0005, 0026, 0027, 0032, 0036-0039, 0045), so the patch count is not the max number; what must hold is the tree-hash equality, not the count. Current verified state: fork HEAD `2d590d770` is mirrored by worktree patch `0063-feat-cuda-trace-cublas-tensor-names.patch`; applying all `54` patch files on `0ed235ea2c17a19fc8238668653946721ed136fd` produces tree `dedb1182910eafe9f6875588dc8285bfb544cce5`, exactly matching the fork. + +### 2.6 Bench hygiene gates +- **NEVER set `LLAMA_MAX_BATCH_TOKENS` in benches** (the harness explicitly logs "NO LLAMA_MAX_BATCH_TOKENS"). +- Do **not** set `GDN_TC`, `GDN_CHUNK_MIN`, or `LLAMA_PAGED_DECODE_STABLE` in parity benches. Production defaults are compiled in: **GDN M5 on (`GDN_TC=5`, `GDN_CHUNK_MIN=64`), S1 decode-graph on, S3 off.** +- **Decode profiling MUST use `nsys --cuda-graph-trace=node`** (see section 3.4). This is a gate, not a suggestion. + +--- + +## 3. OPERATIONAL QUICKSTART (copy-pasteable) + +### 3.0 Host +``` +ssh dgx.casa # resolves to hostname promaxgb10-4ad8; GPU = NVIDIA GB10 (unified LPDDR5x, ~273 GB/s, the bandwidth floor) +``` +`nvidia-smi` reports memory as `[N/A]` (unified memory). CUDA 13 / sm_121. + +### 3.1 GPU lock protocol (`~/gpu_bench_lock`) - TWO conventions, reconcile carefully +There are two conventions in flight: +- **Old harnesses** (`combined_definitive.sh`, `fuse_validate.sh`, `fuse_profile.sh`) treat it as an **empty mutex dir**: `mkdir ~/gpu_bench_lock` to acquire, `rmdir` to release. +- **Newer harnesses** (`fp4norm_profile.sh`) use an **owner-file convention**: `mkdir -p ~/gpu_bench_lock` then `echo "$ME $(date +%s)" > ~/gpu_bench_lock/owner`. They poll until `nvidia-smi --query-compute-apps=pid` count is 0 AND `owner` is `FREE*`/absent for 2 consecutive checks, and clear a stale `~/gpu_bench_lock/release` file. Release **writes** `FREE released-by-... $(date +%s)` to `owner` (it does NOT remove the dir). + +Because the dir now permanently contains an `owner` file, **release with `rm -rf ~/gpu_bench_lock`, NOT `rmdir`** (rmdir fails on the non-empty dir). Recommended procedure for a future agent: +1. Read `~/gpu_bench_lock/owner`. `FREE*`/absent + 0 compute-apps means free. +2. Acquire via `mkdir -p ~/gpu_bench_lock` + write `owner`. +3. Release by writing `FREE ...` to `owner` (or `rm -rf ~/gpu_bench_lock`). + +A separate 0-byte `~/bench/gpu.lock` is legacy/unrelated - ignore. + +**Always gate on ALL THREE** before benching or building on DGX: `nvidia-smi --query-compute-apps=pid` count == 0, `owner` FREE, and `docker ps` shows no running containers. In particular, do not start work while a `local-ai-worker` container is running. Concurrent jobs share this GPU: an offline-repack Marlin workflow, an `~/.cache/autoresearch-quant/` quant pipeline (this is the `llama-imatrix` class of job), finetune trees, and LocalAI worker containers. The canonical harnesses poll for GPU-idle up to 2h. + +### 3.2 Build (long; run detached + poll) +- **Mainline / canonical grpc-server + binaries: CUDA arch `121`** (`-DCMAKE_CUDA_ARCHITECTURES=121`). Runtime banner shows `ARCHS = 1210 | BLACKWELL_NATIVE_FP4 = 1`. +- **FP4-MMA / tensor-core experimental kernels: the accelerated `121a` gencode** (`arch=compute_121a,code=[compute_121a,sm_121a]`). The `a` suffix unlocks tcgen05 / native FP4-MMA intrinsics. `121a` lives ONLY in the DGX experimental build scripts (`~/gdn_cc.sh` standalone nvcc, `~/gdn_bv_build.sh` `-DCMAKE_CUDA_ARCHITECTURES=121a`, `~/paged-build.sh` `--build-arg CUDA_DOCKER_ARCH=121a`), not in the worktree build files. Supply it at build time via `CMAKE_CUDA_ARCHITECTURES` / `CUDA_DOCKER_ARCH`. +- **Long builds: run detached and poll for a marker.** Pattern: `nohup ... > build.log 2>&1 &` then poll for a `.DONE`/`.done` file. Do NOT block a foreground shell. + +Built binaries live at `dgx:~/llama-paged-dev/build-cuda/bin/` (`llama-server`, `llama-batched-bench`, `llama-completion`; thin ~70 KB dynamic wrappers). + +### 3.3 The standard bench env + commands +``` +cd /home/mudler/llama-paged-dev/build-cuda/bin +L="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1" # GGML_NO_BACKTRACE is log-hygiene, not a lever +MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf # arch qwen35moe, ~22.2 GiB +DENSE=/home/mudler/bench/q36-27b-nvfp4.gguf # arch qwen35, ~17.5 GiB + +# (1) Bit-exact / coherence gate. stdin MUST be /dev/null or it hangs in conv mode. +env $L ./llama-completion -m "$MOE" -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -no-cnv \ + -p "The capital of France is" /home/mudler/bench/paged_server.log 2>&1 & +# poll http://127.0.0.1:8090/health for '"ok"', then: +python3 /home/mudler/bench/h2h_cli3.py # OpenAI /v1/completions, ignore_eos, fresh-nonce, ptok128 gen128, NPL sweep 8/32/128/256 +``` +**vLLM side** (for both-engine parity): `~/vllm-bench/bin/vllm` (version **0.23.0**), served `gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1`, models `~/bench/q36-35b-a3b-nvfp4-vllm/` and `~/bench/q36-27b-nvfp4-vllm/`. + +**Current-stack serving snapshots use `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`.** It targets the clean `~/llama-phase6-source` mirror, checks docker/`local-ai-worker`/GPU-idle state, uses the owner-file lock, runs pre/post inference gates, then compares paged and vLLM with the same h2h client. The older `dgx:~/bench/combined_definitive.sh` is historical: do not reuse it without first porting away from stale `~/llama-paged-dev` paths and old lock assumptions. +The harness also writes `hardware.txt` before any server starts, including +`DRY_RUN=1`, so every new snapshot records the GPU model, driver, compute +capability when exposed by `nvidia-smi`, and a conservative hardware class. +Full runs also write `gate_summary.tsv` after the post gate, summarizing pre/post +MoE md5, dense md5, and backend-op checks; use +`paged-current-serving-snapshot.sh --summarize-gates ART` to backfill or audit an +existing snapshot without starting servers. + +### 3.4 THE DECODE-PROFILING RULE (this trap caused 4 wrong analyses) +Decode runs as a **replayed CUDA graph**. `nsys` **without** `--cuda-graph-trace=node` collapses each graph replay into ONE opaque launch, so every per-kernel attribution becomes an artifact. This is exactly what made the old "paged 159 us/tok, GPU ~16% busy, host-bound, 5.4x more GPU-efficient" story wrong, and produced the wrong ~56% headline. + +Mandatory method for any decode profile: +- Use **`nsys --cuda-graph-trace=node`**. +- Decompose with the **difference method**: per-token cost = (ntg=64 profile) - (ntg=16 profile). + +Under the correct method, paged decode at npl=256 is **99% GPU-busy (1.4% idle), NOT host-bound** - the opposite of the collapsed-graph reading. The clean graph-node-traced profiles are at `~/highN_prof2/*.nsys-rep` (paged, npl=256) and `~/highN_vllm/*.nsys-rep` (vLLM), captured 2026-06-30. They **supersede every earlier decode decomposition.** + +### 3.5 Models + artifacts (all on DGX) +GGUF (paged): `~/bench/q36-35b-a3b-nvfp4.gguf` (MoE, qwen35moe), `~/bench/q36-27b-nvfp4.gguf` (dense, qwen35). vLLM safetensors: `~/bench/q36-35b-a3b-nvfp4-vllm/` (has `hf_quant_config.json` confirming MIXED_PRECISION / FP8-proj), `~/bench/q36-27b-nvfp4-vllm/`. +Authoritative run: `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, per-engine `COMBINED_*_server.log`). A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. NOTE: the `*_RESULTS*`/`*_MAP*` docs live only in the worktree `docs/`, not on the DGX. + +--- + +## 4. THE COMPLETE LEVER MAP (do NOT re-run the rejected ones) + +Verdicts and numbers are from `VLLM_PARITY_FINAL.md` + the cited artifacts. "BE" = greedy-md5 bit-exact; "KL-benign" = lossy path inside the KL band. + +### 4.1 Prefill weight-GEMM track - WHOLE TRACK REJECTED (FP4-MMQ is optimal on GB10) +Decisive surprise: on sm_121 **vLLM itself does NOT run native FP4** - it runs **Marlin W4A16** (FP4 dequant->bf16 in-register + bf16 GEMM) for experts and FP8 projections, capped at ~half FP4 peak, because native CUTLASS NVFP4 grouped-GEMM is broken on consumer Blackwell (TMA-WS init failure, CUTLASS #3096; no tcgen05/TMEM). So MMQ's native FP4 is already structurally competitive here. + +| Lever | What | Verdict | Key number | +|---|---|---|---| +| 0033 dequant->bf16 cuBLAS | route large-M NVFP4 dense GEMM to dequant->bf16 cuBLAS | REJECTED, ships default-off | dense S_PP -49%/-42%/-29% at M=512/1024/2048; BE + KL-better | +| dense-cuBLAS reroute (full) | same across dense+MoE prefill | REJECTED | -31% to -62% band | +| 0034 native FP4-MMA W4A4 | Blackwell `mxf4nvf4` OMMA large-M | REJECTED in-backend | PoC 103 TFLOP/s (57.7% FP4 peak, NMSE=0) but win did not hold in-backend | +| 0035 W4A16-Marlin grouped MoE | FP4->bf16 in-register + bf16 mma, zero act-quant tax | REJECTED (perf) | correct + KL-benign-and-better but **-39%** S_PP vs MMQ | +| 0045/0046 offline-repack / vLLM-verbatim Marlin | repack to Marlin layout; port vLLM kernel verbatim | REJECTED | verbatim correct but -39%; offline-repack same bf16-peak ceiling, no win | + +Why it loses: bf16 TC peak on GB10 is ~half FP4 peak, so any dequant->bf16 kernel caps at ~half FP4-MMQ; the dequant write is an un-amortized weight-sized memory pass (~8x the FP4-read traffic). **The GEMM bucket is not winnable on GB10 with available kernels.** + +### 4.2 Prefill GDN chunked-scan track - M5 tf32 C=16 is the SHIPPED winner +GDN is the #1 prefill-gap contributor (+59.2 us/tok, ~30%). vLLM's FLA `chunk_gated_delta_rule` runs the same math at 36.5 vs paged 95.7 us/tok = 2.62x via tensor-core intra-chunk Gram products. + +| Lever | What | Verdict | Key number | +|---|---|---|---| +| 0031 scalar-serial chunked scan | FLA-style scalar/serial (`GDN_TC=0`) | superseded | correct but ~22% slower at forced C=16 | +| **0047 / M5 tf32 tensor-core scan** | tf32 `m16n8k8` mma form-T solve, f32-only | **SHIPPED default-on under paged** | MoE prefill +3.5% @npp512, +17.7% @npp2048; decode unchanged; BE-benign | +| bf16 CONFIG-C (M8) | bf16 Kc/Qc + 2 C*C scratch, C->64 | REJECTED (not in f32 series) | confirmed geometry then dropped | +| bf16-C16 | bf16 Gram at C=16 | REJECTED | no win; bf16 mantissa unsafe on state-coupled products | +| BV block-occupancy A/B (tf32) | raise blocks/SM | REJECTED (occupancy NOT the bound) | 1844 vs 1814 S_PP (-1.04%, within noise) | +| bf16-C64 | bf16 Gram at C=64 | REJECTED | -18.75%; O(C^2) intra-chunk + serial recurrence dominates | +| Phase 10 C32 slab M5 | C=32, two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | REJECTED | md5-clean after tail-row zeroing, but slower: MoE 2048 2430.32 -> 2054.86; dense 2048 1019.25 -> 903.73 | +| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | REJECTED | md5-clean, but slightly slower: MoE 2048 2441.54 -> 2420.26; dense 2048 1021.06 -> 1015.77 | +| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | GO to one default-off prototype | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | +| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | REJECTED | md5-clean, but slower: MoE 2048 2425.10 -> 2097.76; dense 2048 1016.14 -> 918.19 | + +Why not occupancy/dtype: the cost is the **O(C^2) intra-chunk triangular A-inverse solve + the strictly-serial inter-chunk recurrence**, with C forced to **16** by GB10's 99 KB dynamic-smem cap (the 128x128 f32 state alone is 64 KB). M5 captures the tractable TC part; it does not fully close 2.62x because vLLM's FLA blocked-solve is a more complete TC implementation. + +Phase 13 closes the caveat: the default-off `GDN_GLOBAL_AI32=1` prototype was +correctness-clean but slower. Stop GDN kernel work on GB10 instead of iterating +into f16 Ai or more local reorders. + +### 4.3 Decode / fusion levers - all REJECTED (near-parity already at ~86% true GPU-steady) +| Lever | What | Verdict | Key number | +|---|---|---|---| +| act-quant folded into ggml MMQ | inline y-quant in MoE expert MMQ | REJECTED | **-79.4%**; ggml MMQ re-quantizes y per weight-row-tile x stream-k split, no TC for inline quant | +| norm+quant+silu fusion | one launch (vLLM Triton kernel) | REJECTED (infeasible) | `ggml_cuda_can_fuse` cannot express it: FP4 quant is a mul_mat-internal prologue, silu separated from norm by 2 GEMMs + router | +| Q8_0 / FP8 projection | quantize bf16 GDN/attn projections | REJECTED (regime error) | vLLM DOES use FP8 proj, but at N>=128 proj is only ~12% of stream, closes <=6% | +| NVFP4 the projections | drop proj to NVFP4 | REJECTED | KL-fail, ~+6% PPL; vLLM keeps SAME bf16/FP8 proj, never NVFP4 | +| W4A16-Marlin MoE decode | Marlin grouped expert GEMM at decode | REJECTED | BW-floored wash, ~5% slower | +| bf16-tau per-head SSM (0026) | per-head bf16 tau on SSM decode | DROPPED | flat 780.6 vs 780.0 t/s; earlier "+12%" subsumed by 0028/0029 | +| D3 FA-split / D4 GDN-width-adaptive | older off-critical-path levers | SUPERSEDED reasoning | were rejected via the debunked "5.4x/host-bound" reading; under HNP the GDN scan IS critical path (51%) but is the shared BW floor where paged leads (83% vs 79%), so still not a win | + +Dense decode is **AHEAD at low N (116.7% @ N=8)** - the one operating point where paged is unambiguously faster. + +### 4.4 Serving / engine levers - host loop and scheduler CLOSED +| Lever | What | Verdict | Key number | +|---|---|---|---| +| **0040 / S1** paged decode-graph reuse | `can_reuse` keyed on bucketed block-table dims | SHIPPED default-on | serving reuse 0% -> 72.2% (with S3); static 0% -> 95.5% | +| **0041 / S3** decode-shape-stable scheduling (`LLAMA_PAGED_DECODE_STABLE`) | keep prefill out of decode steps | SHIPPED **default-OFF** (opt-in) | recovers the ~17 pt graph-reuse overhead at a TTFT cost; default-on regressed real serving (2.5x worse TTFT, 20-29% lower e2e throughput) | +| **0043 / D1** full-step MoE decode CUDA graph | graph whole decode step incl. grouped-MMQ MoE dispatch | SHIPPED default-on | +2.6% (npl128) to +5-13% (npl32); D1 premise "host-sync on MoE readback" REFUTED (sync count identical 1457 on/off) | +| S2 double-buffer set_inputs | overlap host input build with GPU | DROPPED | `set_inputs` ~0.05 ms/step, nothing to recover | +| whole-step graph / host loop | host loop as serving residual | CLOSED (~0-1%) | reuse 0% (757.6) == S1+S3 72% (763.3); hostproc only ~4-8% of step wall | +| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested, commit b028c81e)** | inert (BE) but regresses everywhere; N=8 burst 28.16->6.05 tok/s/seq; serving decode is GPU-compute-bound, dummy-row compute > reuse recovered | +| speculative decode (MTP) | draft + verify | **REJECTED for current GB10 serving** | Phase 14 safety passed, but Phase 15 serving A/B regressed hard: n128 decode agg 662.4 -> 138.5 tok/s; likely graph/batch-shape disruption (`graphs reused` 361 -> 1) | + +### 4.5 SHIPPED WINS (all BE / KL-benign) - keep these, do not regress +- **FP4-MMQ MoE/dense GEMM** (native Blackwell FP4-MMA at the FP4 weight-BW floor; reason 4.1 stays default-off). +- **M5 tf32 tensor-core chunked GDN prefill (patch 0047)**, default-on under `LLAMA_KV_PAGED` (`GDN_TC=5` + `GDN_CHUNK_MIN=64`). +- **0042 fused residual-add + RMSNorm + weight-mul** (dense S_PP +0.5%, BE). +- **0044 fused GatedRMSNorm + SiLU gate-mul** (672 -> 336 launches @npp512; dense +1.1%, MoE +0.9%, test-backend-ops 12979/12979). +- **0046 GDN-prefill geometry gate** (gates 0022's decode retune by scan length; recovers +7.2% dense prefill, keeps the decode win, BE). +- **SSM decode-fusion stack (0018-0022, 0028)**: in-place state (+23.5%/+18.9%), fused gather (+37.8%/+35.3%), o_proj reshape (+31.7%/+23.3%), conv in-place (+3.2%/+3.5%), occupancy retune (+11.1%/+8.3%) = the **2.26x / 2.46x over stock** decode multiplier. +- **Serving host loop closed (0040 S1, 0043 D1).** +- **The memory advantage** (1.5-3x lower VRAM, NVFP4-resident, no persistent bf16 dequant copies). +- **Low-N decode lead** (dense 116.7% @ N=8). **Bit-exact output per-path** through the whole series. + +### 4.6 REMAINING / unattempted levers + EV +- **Multi-week persistent-Marlin decode kernel** (vLLM's fused-Marlin MoE persistent-tiling + Triton elementwise): the only path to the residual ~14 pt GPU-steady decode gap. **Low-EV**: decode-only ~4-14%, our own ggml Marlin port already lost -19.6%, needs mature tiling + multi-stream overlap (hard inside a single-stream CUDA graph), GB10-uncertain, and **cannot lift the prefill floor**. Not a free bit-exact lever. +- **Datacenter-Blackwell pivot** (B200, ~8 TB/s HBM, native tcgen05/CUTLASS FP4, TMEM): lifts the LPDDR5x GDN bandwidth floor ~30x and restores exactly the vLLM advantages that lose on GB10. **This is the documented path to parity.** Re-run the methodology on the new silicon, do not reopen GB10 levers. + +The `VLLM_PARITY_LEVER_MAP.md` "pursue list" (A1-A7/B1-B7/C1: graph-safe ragged grouped FP4-MMA MoE kernel, FP8 paged KV, MTP spec-decode, etc.) is the **earlier working brainstorm written before the final profiling**. `VLLM_PARITY_FINAL.md` is the authoritative supersession; treat those buckets as rejected / infeasible / different-hardware unless re-validated on new silicon. + +Phase 14 re-validated the MTP bucket as safe, then Phase 15 rejected it as a +current GB10 serving-throughput lever. Do not enable it by default and do not +keep tuning draft length blindly. The only plausible follow-up is a graph-reuse +and speculative verification batch-shape profile with +`nsys --cuda-graph-trace=node`. Phase 16 ran that profile and supported the +root cause: small-shape baseline reused graphs (`graphs reused = 62`) while MTP +did not (`graphs reused = 1`) and did ~2.3x more GPU kernel work. The fixed +safety gates stayed green before and after the failed serving A/B: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Phase 17 source inspection found no tiny additive graph-reuse fix. MTP +verification rows are real target decode/output rows (`K + 1` per speculative +slot), so fake padding would touch KV, positions, logits, MTP nextn state, and +rollback semantics. If reopened, start with a server-only shape counter around +`server_slot::handle_last_sampled_token()`. Only then consider an opt-in +group/defer-by-draft-length scheduler experiment, with TTFT/throughput and +md5/op gates as kill criteria. + +Phase 18 added the server-only shape trace as patch 0055. Set +`LLAMA_SPEC_SHAPE_TRACE=1` to log `kind=decode` rows and MTP `kind=verify` +`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`. +This is default-off instrumentation only. DGX green check after the patch saw +MTP verify shapes vary (`rows=4`, then `rows=3`) on a tiny request, while the +env-unset run emitted no `spec shape:` lines. Canonical post-patch gates passed: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +Artifacts: +`/home/mudler/bench/phase18_mtp_shape_trace_green` and +`/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after`. + +Next MTP step, if any: trace real serving shape entropy first. Do not implement +a scheduler change until the trace shows repeatable draft-length buckets worth +grouping. Any scheduler experiment must be opt-in/default-off and killed by +TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP +rollback/prefix gate failure. + +Phase 19 ran that trace-only serving measurement and rejected the scheduler +shortcut. Artifact: +`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. Pre/post gates +passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Serving result: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +Shape result: `draft=3` already accounts for 96.2-96.9% of verify slots, so +group/defer-by-draft has little to recover. Full in-flight steps already mostly +use all-`draft=3` vectors; the remaining churn is active-slot/tail churn plus +the real `K + 1` verification-row expansion. Do not build a Phase 20 scheduler +experiment on this evidence. Future MTP work would need a deeper target-verify +graph/state design, not another small server scheduling shortcut. + +Phase 62 ran that gated verify-cost recheck. Artifact: +`/home/mudler/bench/phase62_mtp_verify_cost/20260701_134125`. Pre/post gates +passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. MTP acceptance +was high (`7372/9340 = 0.789`, mean acceptance length `3.33`), but throughput +remained far below the keep threshold: `0.420x`, `0.274x`, and `0.213x` +baseline decode at n8/n32/n128. Shape trace again showed `draft=3` / `rows=4` +dominance (`95.6%`), with `graphs reused = 1`. Keep current MTP rejected unless +a later target-verify/output-row graph-cost design exists; do not tune +`spec-draft-n-max` blindly. + +Phase 20 refreshed the current-stack MoE serving snapshot against vLLM using the +clean `~/llama-phase6-source` mirror (`f2521ab12`) rather than the stale +`llama-paged-dev` benchmark tree. Artifact: +`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. Pre/post gates +passed with canonical MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Current MoE serving snapshot (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +TTFT remains the clearest user-visible gap: paged is 2.88x/3.36x/3.11x slower +than vLLM at n8/n32/n128, and paged prefill_tps is roughly one-third of vLLM. +This keeps the GB10 shortcut closure intact: do not reopen MTP or small +scheduler work. The credible next parity path is a datacenter-Blackwell rerun or +a larger fused-kernel project outside this low-conflict patch stack. + +Phase 21 added a reusable current-stack serving harness: +`backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh`. +It defaults to `~/llama-phase6-source`, validates docker/`local-ai-worker`/GPU +idle state, uses the owner-file lock, runs pre/post inference gates, compares +paged and vLLM with h2h, and writes ratio summaries. DGX dry run passed at +`/home/mudler/bench/phase21_harness_dryrun/20260701_051757`. + +Use this harness for future current-stack GB10 snapshots. Do not reuse +`~/bench/combined_definitive.sh` unless it is first ported away from stale +`~/llama-paged-dev` paths and old lock assumptions. + +Phase 31 re-verified the patch-series mirror invariant after patch `0057`: +applying every LocalAI `patches/paged/0*.patch` with strict `git apply` on top of +Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` produced tree +`4eae628e4ba6f2defa14a19d19f7e4abef9a2647`, exactly matching fork branch +`localai-paged` HEAD `c78e537b5 feat(cuda): trace moe mmq launch shapes`. + +Phase 24 extended `paged-current-serving-snapshot.sh` to write the snapshot +hardware report. DGX dry run passed at +`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`; it recorded +`GPU 0: NVIDIA GB10`, driver `580.159.03`, compute capability `12.1`, and +`hardware_class=gb10_or_workstation_blackwell`. This makes future parity +artifacts self-describing: GB10/workstation Blackwell results must not be used +as datacenter-Blackwell parity evidence. + +Phase 25 extended the same harness to write `gate_summary.tsv`. The summary was +backfilled on the Phase 20 artifact at +`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`; +it records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. + +Phase 26 ran the full audited current-stack snapshot with `hardware.txt`, +pre/post gates, same-session paged and vLLM serving runs, `summary.tsv`, and +`gate_summary.tsv`. Artifact: +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. Hardware was +recorded as `hardware_class=gb10_or_workstation_blackwell`, GPU `NVIDIA GB10`, +driver `580.159.03`, compute capability `12.1`. Every compact gate row was +`ok`: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`, both before and +after the serving run. + +Audited current MoE serving snapshot (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Use Phase 26 as the current audit-grade GB10 snapshot. It keeps the Phase 20 +verdict intact, but the artifact is more useful for future regressions because +it carries hardware classification and compact pre/post inference gates. + +Phase 27 re-profiled the current clean llama.cpp n128 serving path with +`nsys --cuda-graph-trace=node`. Artifact: +`/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. The run matched +Phase 26 throughput closely (`675.5` vs `673.4` decode_agg_tps) and kept gates +green before and after the profile (post retry): MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The node-traced +buckets still put the work in `gdn_core` (`29.59%`) and `mmq_nvfp4` (`28.44%`); +helper dispatch remains too small (`mm_ids` `0.61%`, `gather_mmq` `0.37%`, +`argsort_topk` `0.40%`). Do not reopen metadata/helper-only MoE dispatch work on +GB10. + +Phase 28 tested the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs. +Artifact: `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`. +`GGML_CUDA_FP4_MINBLOCKS=2` passed md5/op gates before and after serving +(MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`) but regressed +n128 same-session decode serving (`705.1 -> 689.9` decode_agg_tps, `0.9784x`). +`GGML_CUDA_FP4_MMQ_Y=64` failed to compile because the NVFP4 writeback +specialization asserts `nwarps*tile_C::I == mmq_y`. Do not promote either knob; +future grouped-MMQ work must be structural kernel work. + +Phase 29 added the default-off grouped-MMQ shape trace as patch `0056`. +Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`. +Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes`. The helper was +added test-first (`test-cuda-mmq-shape-trace`) and built under CUDA on DGX. +Default-off and `LLAMA_MOE_MMQ_SHAPE_TRACE=4` gates both passed: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The trace-enabled +gate emitted exactly four `[LLAMA_MOE_MMQ_SHAPE]` lines. This is evidence-only +instrumentation; it does not close the speed gap. + +Phase 30 used patch `0056` for a live n128 serving shape trace. Artifact: +`/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300`. The first 4096 +grouped-MMQ calls split into 1200 decode-like calls (`ncols_max <= 128`) and +2896 prefill-like calls. Decode-like calls had density `1-4` and selected +`mmq_x_best` only in `{32,40,48,64}`; prefill-like calls were mostly density +`16` and selected `mmq_x_best=128`. All traced calls had `stream_k=1`. Post-run +gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. + +Phase 31 added patch `0057` for default-off grouped-MMQ launch tracing. +Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. +Fork commit: `c78e537b5 feat(cuda): trace moe mmq launch shapes`; DGX mirror +commit: `8b75905e9`. The trace adds `[LLAMA_MOE_MMQ_LAUNCH]` lines under +`LLAMA_MOE_MMQ_SHAPE_TRACE=`, recording `ntiles_dst`, `stream_k_blocks`, +tile efficiency, `fixup`, `ntx/nty/ntzw`, and compiled `mmq_x/mmq_y`. Default +off, trace-enabled, and post-serving gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving +trace showed decode-like `4800/4800` and prefill-like `4920/4920` launch lines +with `fixup=0` and `stream_k_blocks == ntiles_dst`. Do not pursue a +no-fixup/no-stream-k shortcut for this workload; the remaining grouped-MMQ work +is structural small-M kernel work. + +Phase 32 added patch `0058` for default-off small-M grouped-MMQ candidate +tracing. Artifact: `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`. +Fork commit: `2a9964d29 feat(cuda): trace moe small-m mmq candidates`; DGX +mirror commit: `024f494d0`. The trace adds `[LLAMA_MOE_MMQ_SMALL_M]` lines +under `LLAMA_MOE_MMQ_SMALL_M_TRACE=` for decode-like low-density grouped-MMQ +MoE calls (`ncols_max <= 128`, density `<=4`, `mmq_x_best <=64`). Default-off, +trace-enabled, and post-serving gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. The n128 serving +trace found 4096 candidate calls, mostly `mmq_x_best=64` (1800) and `48` +(1096). Phase 33 should A/B a default-off small-M tile policy starting at +`mmq_x=16`. + +Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=`, and +rejected the simple smaller-tile policy. Artifact: +`/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`. Fork commit: +`fbed2abaa feat(cuda): gate moe small-m mmq tile policy`; DGX mirror commit: +`dfd1eaea8`. Default-off, tile16, tile8, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Same-session n128 +serving rejected both caps: baseline `672.1` decode_agg_tps, tile16 `640.3` +(`0.953x`), tile8 `583.2` (`0.868x`). Do not promote smaller `mmq_x` caps. + +Phase 34 added patch `0060`, default-off `LLAMA_MOE_MMID_ROUTE_TRACE=`, to +classify the live `MUL_MAT_ID` dispatch route without changing route behavior. +Artifact: `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`. Fork +commit: `6c332094c feat(cuda): trace moe mmid routes`; DGX mirror commit: +`34a256d14`. Default-off, trace-enabled, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID` `806/806`. Live n128 serving +with trace cap 4096 found `mmq=2776`, `mmvq=1320`, and `host_sync=0/4096`. +Treat the old current-stack host-sync-fallback concern as refuted for this +workload; the remaining MoE work is grouped-MMQ small-M efficiency or another +measured bucket. + +Phase 35 added patch `0061`, default-off `LLAMA_MUL_MAT_ROUTE_TRACE=`, to +classify regular `MUL_MAT` routes for the projection-heavy serving bucket. +Artifact: `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`. +Fork commit: `486c28c63 feat(cuda): trace mul mat routes`; DGX mirror commit: +`18f7ad005`. Default-off, trace-enabled, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Live n128 serving with trace cap 8192 found `mat_f=2888`, +`op_cublas=2292`, `mmq=1328`, `vec_q=1214`, `vec_f=470`; BF16 (`type=30`) +was split `mat_f=2485`, `op_cublas=1330`. Next projection work should target +BF16 `mat_f`/`op_cublas` subroute evidence or route policy, not batched cuBLAS. + +Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=`, to +classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior. +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. +Fork commit: `38c4ef2e4 feat(cuda): trace cublas routes`; DGX mirror commit: +`e0224393a`. Default-off, trace-enabled, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Live n128 serving with trace cap 8192 found `bf16_tc=5681` and +`sgemm=2511`. The next projection phase should explain whether the F32 SGEMM +shapes are expected glue tensors or a missed BF16 route; do not chase NVFP4 +cuBLAS or batched cuBLAS for this measured bucket. + +Phase 37 added patch `0063`, extending `LLAMA_CUBLAS_ROUTE_TRACE=` with +`src0`, `src1`, and `dst` tensor names. Artifact: +`/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`. Fork commit: +`2d590d770 feat(cuda): trace cublas tensor names`; DGX mirror commit: +`2cbb61969`. Default-off, trace-enabled, and post-serving gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Live n128 trace found `bf16_tc=2884`, `sgemm=1212`. The `sgemm` +bucket is `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and +`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`; do not force BF16 +without first inspecting model-load tensor types and running KL validation. + +Phase 38 is the current gate-projection policy checkpoint. Artifact: +`/home/mudler/bench/phase38_gate_baseline/20260701_084410`. Preflight showed +docker `0`, `local-ai-worker` `0`, compute apps `0`, and GB10 driver +`580.159.03`. Fresh baseline gates against the Phase37 build passed: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Source comparison found llama.cpp and vLLM both keep router and +shared-expert gate weights unquantized; vLLM's relevant idea is fused F32 gate +weight concatenation, not BF16/NVFP4 routing. Future fused-gate work must be +default-off, preserve F32 semantics, and pass md5/op gates before benchmarking; +if md5 changes, run KL first. + +Phase 39 closes the naive fused-gate shortcut. Artifact: +`/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis`. Re-analysis +of the Phase27 graph-node serving profile showed total kernel time `20.0372s`, +`concat_layout=459.84ms` (`2.29%`, `2250` instances), `cublas_bf16_gemm=1892.81ms` +(`9.45%`), and `cutlass_bf16_gemm=684.01ms` (`3.41%`). Do not implement +graph-time `ggml_concat()` of `ffn_gate_inp.weight` plus +`ffn_gate_inp_shexp.weight`; it risks increasing an existing layout-copy bucket. +The only future fused-gate design worth scoping is a persistent/load-time F32 +combined gate weight with output views, default-off until MoE/dense md5, +`MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. + +Phase 40 closes the tested GB10 max-concurrency C1 shortcut. Artifact: +`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The snapshot ran +with `PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, +and `OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Paged safely served `n=256`, but vLLM also fit and remained faster: +`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`, +`paged_ttft_over_vllm=2.9401`. Do not claim GB10 parity from higher max +concurrency at this prompt/gen length and `n<=256`; a future C1 retry must push +beyond this tested point and keep the same md5/op gates. + +Phase 41 records the low-concurrency counterpart to the Phase40 high-concurrency +check. Artifact: +`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The snapshot ran +with `PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. Pre/post gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Paged is about `0.75x` vLLM decode at `n=1/8` and `0.665x` at +`n=32`; TTFT is `1.38x`, `3.14x`, and `3.40x` vLLM respectively. Do not reopen +D1 from this result: `0043` already ships grouped-MMQ full-step graph capture +default-on, Phase34 found `host_sync=0/4096`, and S3 is default-off because it +regressed TTFT/end-to-end throughput. + +Phase 42 reconciles the target list after parallel read-only review. D1 is +closed on the current GB10 path; GDN low-conflict work is exhausted after +`0046`/`0047` plus the rejected C32/QS-early/Global-Ai32 follow-ups; W4A16/GEMM +micro-tweaks are exhausted after `0033`-`0035` and `0048`-`0050`. It nominated +the Phase38/39 persistent/load-time F32 combined gate projection as the last +small GB10 source candidate. + +Phase 43 rejects that gate-fusion candidate as a small shortcut after source +inspection. `ffn_gate_inp.weight` and `ffn_gate_inp_shexp.weight` are separate +GGUF tensors; the Qwen35MoE graph consumes them in separate matmuls; the loader +can create tensors from GGUF metadata or views of existing tensors, but not a +new persistent derived concatenated weight. A correct implementation would need +a general derived-weight allocation/materialization path across mmap, offload, +split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, and do +not fall back to graph-time `ggml_concat()`. After Phase43 there is no remaining +low-conflict GB10 shortcut justified by current evidence; future work is either +a larger kernel/loader design or a hardware-pivot benchmark, still gated by +MoE/dense md5 plus `MUL_MAT`/`MUL_MAT_ID` and KL if md5 changes. + +Phase 44 makes the current-stack serving snapshot harness ready for hardware +pivots by parameterizing the vLLM side instead of hardcoding the GB10 defaults. +`paged-current-serving-snapshot.sh` now accepts `VLLM_GPU_MEMORY_UTILIZATION`, +`VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and +whitespace-split `VLLM_EXTRA_ARGS`, and prints the resolved values during +`DRY_RUN=1`. This is not a new benchmark and does not change inference code or +gate behavior. Use it when the next parity run targets datacenter Blackwell or +another non-GB10 vLLM serving shape, while keeping `hardware.txt`, pre/post +MoE/dense md5, `MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes as mandatory gates. + +Phase 45 records the immediate inference-safety guard after Phase44. Artifact: +`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. The DGX +phase36 build passed MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. Docker, `local-ai-worker`, and GPU compute preflight were all zero +before and after the run. + +Phase 46 removes the last hardcoded `q36` served-model name from the audited +serving snapshot harness. Set `SERVED_MODEL_NAME` to drive vLLM +`--served-model-name`, the vLLM readiness check, and h2h `--model` on both +engines. DGX dry run: +`/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`, with +`SERVED_MODEL_NAME=dense-q36` printed during `DRY_RUN=1`. This is harness-only +hardware-pivot readiness, not a throughput result. + +Phase 47 attempted the first dense serving snapshot using the Phase46 override. +Dry-run artifact: +`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`; incomplete +full artifact: `/home/mudler/bench/phase47_dense_serving/20260701_095151`. +Pre-gates were green and the paged dense arm completed through `n=128`, but the +artifact is not a dense parity result because vLLM produced no result JSONs. +Root cause: dense vLLM startup exceeded the old fixed readiness budget, and the +cleanup path could wait indefinitely on the server PID after `SIGTERM`. + +Phase 48 hardens the serving snapshot harness for that failure mode. It adds +`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds HTTP readiness probes +with `curl --max-time 2`, and uses bounded server cleanup that escalates from +`SIGTERM` to `SIGKILL`. Dry-run artifact: +`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`, with +`VLLM_READY_ATTEMPTS=700` printed and clean DGX preflight. + +Phase 47 retry completed after Phase48. Artifact: +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811`. Pre/post +gates were green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Dense paged decode beats vLLM at low concurrency (`1.3434x` at `n=1`, +`1.1560x` at `n=8`) but falls behind at `n=32/128` (`0.9036x`, `0.7912x`), and +TTFT remains `1.87x` to `4.05x` vLLM. This does not change the GB10 conclusion. + +Phase 49 removes vLLM log noise from harness-owned environment variables. The +`vllm serve` child now unsets `VLLM_MODEL`, `VLLM_BIN`, +`VLLM_READY_ATTEMPTS`, `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, +`VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and `VLLM_EXTRA_ARGS` while +preserving intentional vLLM runtime variables such as `VLLM_LOGGING_LEVEL`. Dry +run: `/home/mudler/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138`. + +Phase 50 resolves the dense high-N decode-accounting question with a graph-node +difference-method profile. Artifact: +`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. Pre/post +inference gates on the profiled `build-cuda` binary stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. Dense `npl=128`, `npp=128` true decode is `383.66 t/s` for paged and +`435.00 t/s` for vLLM, ratio `0.8820`. This means Phase47's `0.7912` h2h +decode ratio and `0.5071` aggregate ratio include scheduler/admission and +prefill-overlap/accounting effects beyond the real GPU-steady decode gap. Next +GB10 code work should instrument batch composition/admission in +`server_context::pre_decode()` before attempting another kernel shortcut. + +Phase 51 implements that admission trace in the llama.cpp fork. Local fork +commit: `c6cb8460e feat(server): trace serving admission batches`. The trace is +default-off behind `LLAMA_SERVING_TRACE=1`, adds a small unit-tested accumulator, +and records aggregate `pre_decode()` scheduler shape: decode tokens, prompt +tokens admitted, waiting prompt slots, started/continued prompt slots, +decode-only steps, `n_batch`, `n_ubatch`, `prefill_budget_step`, and +`prefill_cap_per_slot`. DGX artifact: +`/home/mudler/bench/phase51_serving_admission_trace/20260701_110130`. The +patched `build-cuda` CTest passed and inference gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Push and LocalAI patch-series regeneration are still pending because +push requires explicit approval. + +Phase 52 uses the Phase51 trace on DGX for dense `n=128`, `ptok=128`, `gen=64`. +Artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`. +Pre/post md5 and op gates stayed green. The clean traced h2h row was +`decode_agg_tps=360.5`, `prefill_tps=629.5`, `ttft_mean_ms=23171.5`, wall +`58.921s`. The admission trace reported `steps=76`, `decode_only_steps=0`, +`decode_tokens=8064`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`, +`started_prompt_slots=128`, `continued_prompt_slots=139`, +`prefill_budget_step=0`, and `prefill_cap_per_slot=0`. The prompt token count +matches h2h exactly, so this is the target request. The next GB10 lever should +be a default-off scheduler/admission A/B or a per-step histogram trace, not an +immediate GDN/GEMM rewrite. + +Phase 53 tested the existing runtime admission-budget knobs instead of adding +new code. Artifact: +`/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`. +Pre/post gates stayed green. Dense `n=128` results: default Phase52 `agg=139.0`, +`decode_agg=360.5`, `prefill=629.5`, `TTFT=23171.5ms`, wall `58.921s`; +`T=1536 cap=512` `agg=134.4`, `decode_agg=376.7`, `prefill=607.0`, +`TTFT=22263.7ms`, wall `60.968s`; `T=1024 cap=512` `agg=130.0`, +`decode_agg=392.4`, `prefill=565.2`, `TTFT=23234.3ms`, wall `63.003s`. +Decision: simple budget shrinkage is rejected. It raises h2h decode-agg while +lowering aggregate/prefill throughput, and it does not materially solve TTFT. +Next scheduler work should be per-step histograms or a targeted first-token +admission policy. + +Phase 54 through Phase 59 tested that targeted scheduler path. The fork commits +are still local-only and default-off: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` +- `8a97629a4 feat(server): add TTFT prefill-first scheduler mode` +- `3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral` +- `8759213e3 feat(server): gate TTFT defer by prompt backlog` + +Phase59 is the current verdict. Artifact: +`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. Pre/post +llama gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. MoE `n=128`, `ptok=128`, `gen=64` repeated the Phase58 min32 signal: +llama default `agg=336.6`, `TTFT=7798.5ms`, wall `24.334s`; llama min32 +`agg=336.9`, `TTFT=7167.8ms`, wall `24.316s`. Matching vLLM was still +`agg=601.3`, `TTFT=2968.1ms`, wall `13.563s`. + +Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` and +`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` as an opt-in llama.cpp latency/QoS +knob. It does not prove vLLM parity progress by itself. Do not default it until +more workload coverage exists, and do not regenerate LocalAI patches until the +fork commits are pushed with explicit approval. + +Phase 60 re-profiled the current W4A16 grouped MoE prefill path to check whether +there was still a low-conflict W4A16 shortcut after Phase1-5. Artifact: +`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. Pre/post +gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`. Default FP4-MMQ S_PP was `2327.69` at `npp=512` and `2423.20` at +`npp=2048`; forced W4A16 was `1451.00` and `1482.76`, only `0.623x` and +`0.612x` of default. The `npp=512` profile showed W4A16 still dominated by +`w4a16_grouped_kernel` (`4.142s`, `42.5%`) plus sorted activation gathers +(`1.094s`, `11.2%`), while the cast kernel was only `0.517s` (`5.3%`). + +Decision: do not add another small W4A16 metadata/body/cast patch. Future W4A16 +work needs a larger redesign that improves the grouped kernel body and removes +or fuses sorted activation movement. Near-term GB10 parity work should return to +broader prefill/GDN/MoE design or hardware-pivot benchmarking. + +Phase61 is scoped as that larger W4A16 kill-gate, not as a committed code +change: `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md`. +It proposes a default-off `LLAMA_W4A16_DIRECT_A=1` experiment that consumes the +original activation tensor plus the existing `ids_to_sorted` map directly, +removing Phase60's sorted activation gather and separate cast kernels before any +grouped-kernel body rewrite. Keep it only if it improves forced W4A16 S_PP by at +least `+12%` and reaches at least `0.75x` default FP4-MMQ; otherwise reject and +do not continue W4A16 body tuning. + +Phase61 result: rejected. The direct-A kernel passed correctness after matching +`get_rows_cuda` flat-row addressing (`MUL_MAT_ID` `806/806`; forced/direct-A +MoE transcript md5 both `07db32c2bcb78d17a43ed18bc22705cd`) and default gates +remained green (`8cb0ce23`, `5951a5b4`, `MUL_MAT` `1146/1146`, `MUL_MAT_ID` +`806/806`). But direct-A only improved forced W4A16 S_PP `1471.05 -> 1566.30` +at `npp=512` and `1502.46 -> 1605.82` at `npp=2048` (`+6.5%` / `+6.9%`), still +just `0.67x` / `0.66x` of default FP4-MMQ. The direct kernel diff was not +committed; only the safe policy/routing stub remains in the fork. Do not pursue +more W4A16 body tuning on GB10 as the next parity lever. + +--- + +## 5. METHODOLOGY LESSONS (so you do not repeat the mistakes) + +1. **Profile, don't assume. The analysts were wrong 4 times.** Every one was caught only by an in-backend A/B or a corrected profile: + - **GDN-scalar grep** (assumed the scan was scalar/serial from reading source) - wrong, retired by the tensor-core port. + - **dense-cuBLAS reroute** (assumed dequant->bf16 would win) - wrong, -31% to -62%. + - **occupancy** (assumed blocks/SM was the GDN bound) - wrong, 1844 vs 1814 within noise. + - **projection-regime** (assumed FP8/NVFP4 projections were a big lever) - wrong, projections are ~12% of the decode stream at high N. + **In-backend A/B is the only truth.** A standalone PoC win (0034) is not a result. +2. **Per-kernel us/tok overstates end-to-end S_PP/S_TG.** A kernel that is X% faster in isolation does not move throughput X%; always confirm against the end-to-end batched-bench / serving number. +3. **The CUDA-graph-trace decode artifact (the big one).** Decode is a replayed graph; nsys without `--cuda-graph-trace=node` collapses it and lies. This single trap produced the wrong "host-bound / 159 us/tok / 56%" story across multiple analyses. Always graph-node-trace + difference method (section 3.4). +4. **Beware GPU contention skewing absolutes.** The box runs concurrent quant/repack/finetune jobs. Gate on idle GPU + free lock; prefer the same-session both-engine harness so both numbers move together. +5. **The vLLM server number is inflated ~8 pt vs its true GPU-steady.** vLLM's chunked-prefill-overlap inflates its own server-measured decode window (1177 server vs 1078 true GPU-steady). Compare GPU-steady to GPU-steady, or you will chase a phantom gap. The reconciliation chain that must sum: vLLM server 1177 (100%) -> vLLM true GPU-steady 1078 (92%) -> llama GPU-steady 924 (78.5% of 1177, = 86% of 1078) -> llama server 718 (60.7%, the S3-recoverable serving overhead). + +--- + +## 6. THE THREE FORWARD DIRECTIONS + +### (a) Close / ship the record (lowest effort, do this first) +The investigation is closed for GB10 shortcuts, and the closeout chores below +are now done: + +- patch `0044` is tracked in the LocalAI series; +- the Makefile pin `0ed235ea2c17a19fc8238668653946721ed136fd` is the + authoritative paged pin; +- Phase 20 re-ran the current-stack serving snapshot on the clean mirror; +- Phase 22 re-verified the patch-series mirror invariant after `0055`. + +For future release checks, run `paged-inference-gates.sh` and +`paged-current-serving-snapshot.sh` from the LocalAI backend tree. The inference +gate now defaults to both `MUL_MAT` and `MUL_MAT_ID`; set `OPS=` only for a +focused diagnostic run. + +### (b) Datacenter-Blackwell pivot (THE real parity path) +The thesis: every vLLM advantage that wins on GB10 is a kernel that is **broken or capped on consumer Blackwell** and **inverts on datacenter Blackwell** (B200): FLA blocked-solve GDN, Marlin/CUTLASS grouped FP4, HBM-tuned full-cudagraph decode, native tcgen05/TMEM. ~8 TB/s HBM lifts the LPDDR5x GDN bandwidth floor ~30x. Concrete first steps: +1. Acquire a B200 (or equivalent HBM tcgen05 part). Reproduce the **both-engine same-session harness** there (`combined_definitive.sh` discipline): build the stock and paged binaries, build vLLM 0.23.0+, run MoE + dense prefill + serving for both engines. +2. Re-measure the FP4 path: on B200, native CUTLASS NVFP4 grouped-GEMM should work (the CUTLASS #3096 / TMA-WS failure is consumer-Blackwell-specific). Confirm whether vLLM now runs **native FP4** instead of Marlin W4A16. If so, the 4.1 GEMM track must be re-evaluated from scratch (it was rejected on a GB10-specific ceiling). +3. Re-take the decode profile with `--cuda-graph-trace=node`; the GDN scan that floors at 273 GB/s on GB10 should no longer dominate at HBM bandwidth - re-derive the per-token decomposition before choosing any lever. + +### (c) Multi-week persistent-Marlin decode kernel (decode-only, low-EV, CANNOT reach parity) +Only pursue if (a)+(b) are not options and someone explicitly wants the residual decode gap closed on GB10. It targets the ~14 pt GPU-steady decode gap (vLLM's fused-Marlin MoE persistent-tiling + single Triton elementwise). Concrete first steps: +1. Re-confirm the ceiling first: our own ggml Marlin port already lost -19.6% at decode (4.3), so the bar is "beat that and beat FP4-MMQ at the decode BW floor". +2. Prototype the persistent-tiling grouped-FP4 MoE kernel **standalone**, then prove it **in-backend** (a PoC win is not a result, per 0034). It must live inside a single-stream CUDA graph or bring its own multi-stream overlap. +3. Bound the upside honestly: this is **decode-only ~4-14%** and **does nothing for the prefill floor (36-43%)**, so it does not reach parity. Record the verdict either way. + +--- + +## 7. KEY FILE / ARTIFACT INDEX + +### Fork (canonical source of truth) +- Local canonical fork: `/home/mudler/_git/llama.cpp`, branch **`localai-paged`**, HEAD `2d590d770` ("trace cublas tensor names", patch `0063`). +- DGX current clean mirror/build tree: `dgx:~/llama-phase6-source`, HEAD `2cbb61969` with the Phase 37 cuBLAS tensor-name trace patch applied and committed; Phase 20/26/27 artifacts still record their historical source hashes. +- Historical DGX dev tree: `dgx:~/llama-paged-dev`, branch **`paged`**, HEAD `a7d439e8ce6990eb09721223c975da4e49d8d136` ("GDN CONFIG C (M8) - bf16 Kc/Qc"). It is an old experimental tree and must not be treated as canonical. + +### LocalAI worktree +- Path: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention`, branch `worktree-feat+paged-attention` (currently 246 ahead, 31 behind `origin/master`; recompute before reporting). +- Backend dir: `backend/cpp/llama-cpp-localai-paged/` (`Makefile` thin wrapper, `package.sh`, `run.sh`, `README.md` ~44 KB canonical, `docs/`, `patches/paged/`). +- `docs/`: `VLLM_PARITY_FINAL.md` (authoritative record), `VLLM_PARITY_LEVER_MAP.md` (working brainstorm, profile-validated section), `DECODE_SERVING_SCOPE.md`, `PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `ACCELERATOR_PORTING_SCOPE.md`, `UPSTREAM_LAYER2_SCOPE.md`, `LOCALAI_LLAMACPP_BACKEND_PLAN.md`, `PAGED_BITEXACT_NOTE.md`, `PATCH_MAINTENANCE.md`, `final_benchmark.csv`, `paged-burst-bench.cpp`, `paged-reclaim-unit.cpp`, 3 PNGs, and this `PARITY_HANDOFF.md`. +- `patches/paged/`: **54** `.patch` files spanning 0001-0063 with intentional gaps (missing 0005, 0026 [dropped ssm_bf16_tau], 0027, 0032, 0036-0039, 0045). Core paged-KV 0001-0012; decode-first scheduler 0013/0016; serving graph reuse 0040/0041; prefill fusions 0042/0044; SSM/GDN decode 0018-0022/0028; MoE NVFP4 quant 0023/0025/0043; FP4-MMA/Marlin scaffolds 0033/0034/0035 (default-off); GDN tensor-core prefill 0031 -> 0046 (geometry gate) -> 0047 (f32-only M5, default-on under paged KV); W4A16 packed metadata/shape/padding is 0048-0050; MoE safety tests are 0051-0053; MTP backend-sampling safety is 0054; speculative shape trace is 0055; MoE MMQ selector/launch/candidate/tile-policy/route instrumentation is 0056-0060; regular MUL_MAT route instrumentation is 0061; cuBLAS route instrumentation is 0062-0063. + +### Bench artifacts (DGX) +- `~/bench/COMBINED_DEFINITIVE.txt` (+ `.log`, `.done`, `combined_definitive.sh`, `combined_definitive.out`) - historical same-session both-engine run. +- `~/bench/phase20_current_snapshot/20260701_050621` - current clean-stack paged-vs-vLLM MoE serving snapshot. +- `~/bench/phase21_harness_dryrun/20260701_051757` - current snapshot harness dry-run artifact. +- `~/bench/phase24_hardware_report_dryrun/20260701_052741` - current snapshot harness dry run proving `hardware.txt` captures the DGX as `hardware_class=gb10_or_workstation_blackwell`. +- `~/bench/phase25_gate_summary_dryrun/20260701_053353` - dry run after adding `gate_summary.tsv` support; normal dry-run still writes `hardware.txt` and does not emit a gate summary before gates exist. +- `~/bench/phase26_audited_snapshot/20260701_053650` - current audit-grade full paged-vs-vLLM MoE serving snapshot with `hardware.txt`, pre/post gates, `summary.tsv`, and `gate_summary.tsv`. +- `~/bench/phase27_graph_node_serving/20260701_055519` - current clean llama.cpp n128 serving profile captured with `--cuda-graph-trace=node`, pre/post retry gates green. +- `~/bench/phase28_mmq_occupancy/20260701_040450` - NVFP4 MMQ occupancy build-knob A/B; `MINBLOCKS=2` gate-safe but serving-regressed, `MMQ_Y=64` compile-rejected. +- `~/bench/phase29_mmq_shape_trace/20260701_042428` - default-off MoE MMQ shape trace patch `0056`; CUDA build plus default/trace md5 gates green. +- `~/bench/phase30_mmq_shape_serving/20260701_043300` - live n128 serving MMQ shape distribution from patch `0056`; post-run md5/op gates green. +- `~/bench/phase31_mmq_launch_trace/20260701_064424` - default-off MoE MMQ launch trace patch `0057`; default/trace/post-serving md5 gates green; n128 launch trace rejects stream-k/fixup shortcut (`fixup=0`, `stream_k_blocks == ntiles_dst`). +- `~/bench/phase32_small_m_classifier/20260701_070127` - default-off MoE MMQ small-M classifier patch `0058`; default/trace/post-serving md5 gates green; n128 trace found 4096 candidate calls. +- `~/bench/phase33_small_m_tile_policy/20260701_071136` - default-off MoE MMQ small-M tile policy patch `0059`; tile16/tile8 md5/op safe but both slower in n128 serving. +- `~/bench/phase34_mmid_route_trace/20260701_072737` - default-off MoE MMID route trace patch `0060`; default/trace/post-serving md5 gates green; n128 route trace found `mmq=2776`, `mmvq=1320`, `host_sync=0/4096`. +- `~/bench/phase35_mul_mat_route_trace/20260701_074359` - default-off regular MUL_MAT route trace patch `0061`; default/trace/post-serving md5 gates green; n128 route trace found BF16 `mat_f=2485`, `op_cublas=1330`. +- `~/bench/phase36_cublas_route_trace/20260701_081228` - default-off cuBLAS subroute trace patch `0062`; default/trace/post-serving md5 and op gates green; n128 route trace found `bf16_tc=5681`, `sgemm=2511`. +- `~/bench/phase37_cublas_name_trace/20260701_083227` - cuBLAS tensor-name trace patch `0063`; default/trace/post-serving md5 and op gates green; n128 trace identified `sgemm` as MoE gate logits and shared-expert gate projections. +- `~/bench/phase38_gate_baseline/20260701_084410` - current Phase37 build baseline before gate-projection policy work; docker/local-ai-worker/GPU idle preflight green; MoE/dense md5 green; `MUL_MAT` `1146/1146`; `MUL_MAT_ID` `806/806`. +- `~/bench/phase39_gate_sgemm_profile/20260701_085211` - short completion profile, diagnostic only because `-n 32` is not a canonical md5 gate; useful for confirming graph-time concat is a real kernel path. +- `~/bench/phase39_gate_sgemm_profile/phase27_reanalysis` - Phase27 serving profile re-analysis used to reject graph-time fused gate weight concat; `concat_layout=459.84ms` (`2.29%`) in the serving kernel window. +- `~/bench/phase40_max_concurrency/20260701_090012` - max-concurrency C1 check at `NPL=128/192/256`, `PTOK=128`, `GEN=64`, `PARALLEL=256`, `CTX=262144`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green, but vLLM also fit at `n=256` and stayed ahead (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). +- `~/bench/phase41_low_concurrency/20260701_091437` - low-concurrency serving check at `NPL=1/8/32`, `PTOK=128`, `GEN=64`, `PARALLEL=32`, `CTX=32768`; pre/post MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates green; paged is `0.7493`, `0.7518`, and `0.6649` of vLLM decode at `n=1/8/32`, with TTFT still much worse by `n=8/32`; does not reopen D1. +- `~/bench/phase44_hardware_pivot_harness_dryrun/20260701_094038` - harness-only dry-run artifact proving the vLLM serving config overrides are printed and preflighted before any server starts. +- `~/bench/phase45_inference_gate_guard/20260701_094320` - post-Phase44 inference guard; MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` backend-op gates green. +- `~/bench/phase46_served_model_name_dryrun/20260701_094849` - harness-only dry-run artifact proving `SERVED_MODEL_NAME` is printed and preflighted before any server starts. +- `~/bench/phase47_dense_serving_dryrun/20260701_095141` - dense serving dry-run with `SERVED_MODEL_NAME=dense-q36`. +- `~/bench/phase47_dense_serving/20260701_095151` - incomplete dense serving attempt; pre-gates and paged arm completed, vLLM did not produce result JSONs under the old readiness budget. +- `~/bench/phase48_readiness_harness_dryrun/20260701_100533` - harness dry-run proving configurable readiness budgets and clean preflight before retrying dense serving. +- `~/bench/phase47_dense_serving_retry/20260701_100811` - completed dense serving snapshot after Phase48; pre/post md5 and op gates green; paged low-N decode ahead, high-N aggregate and TTFT behind. +- `~/bench/phase49_vllm_env_hygiene_dryrun/20260701_102138` - harness dry-run after scrubbing harness-owned `VLLM_*` variables from the `vllm serve` child environment. +- `~/bench/phase50_dense_true_decode/20260701_103120` - dense graph-node difference-method profile at `npl=128`, `npp=128`; `build-cuda` pre/post md5 and op gates green; true decode paged `383.66 t/s`, vLLM `435.00 t/s`, ratio `0.8820`, pointing next at serving admission/scheduler tracing. +- `~/bench/phase51_serving_admission_trace/20260701_110130` - default-off `LLAMA_SERVING_TRACE=1` fork commit `c6cb8460e`; DGX patched `build-cuda` CTest and md5/op gates green; push and LocalAI patch-series mirror pending approval. +- `~/bench/phase52_dense_admission_trace/20260701_111017` - clean dense `n=128` admission trace; pre/post gates green; `decode_only_steps=0`, `prompt_tokens=22785`, `max_waiting_prompt_slots=35`; next lever is scheduler/admission A/B or per-step histogram trace. +- `~/bench/phase53_dense_admission_budget_sweep/20260701_111915` - runtime sweep of `LLAMA_MAX_BATCH_TOKENS=1536/1024` with `LLAMA_PREFILL_CAP=512`; pre/post gates green; simple budget shrinkage rejected because aggregate/prefill throughput regressed and TTFT did not materially improve. +- Per-engine logs `~/bench/COMBINED_{paged,vllm}_{MOE,DENSE}_server.log`; `~/bench/BENCHMARK_PROGRESS.md`. +- Graph-node-traced high-N profiles: `~/highN_prof2/*.nsys-rep` (paged npl=256), `~/highN_vllm/*.nsys-rep` (vLLM), 2026-06-30. +- A/B dirs: `~/bench/marlin_gate/`, `~/bench/gdn_p1_ab/`. + +### Recent context commits +- `6edbb56b0` "docs(paged): definitive vLLM-parity final-state record (GB10, CLOSED)" - adds `VLLM_PARITY_FINAL.md`. +- `baf102524` "docs(paged): correct decode-serving record to ~86% GPU-steady parity (graph-node-traced)" - the ~56% -> ~86% correction. +- `bd100dd20` "fix(paged): repair the patch series, sync to the fork branch" - dropped dev-tree 0044/0045, added f32-only M5 as 0047. +- `b028c81ed` "docs(paged): record padded/fixed-slot decode shape as tested-and-rejected". + +### Discrepancies to flag / resolve (carried verbatim from the gather, including UNVERIFIED labels) +1. **Pin prose reconciled in this worktree.** Makefile line 52 `LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd` is authoritative and matches the local fork merge-base. Hard rule: the paged pin must equal the stock `llama-cpp` pin (shared `grpc-server.cpp`); a bump to `c299a92c` once broke the grpc-server link despite being bit-exact and was reverted. Trust the Makefile when building. +2. **Current fork/mirror are clean and verified.** Local fork HEAD is `2d590d770`, DGX clean mirror HEAD is `2cbb61969`, and Phase 37 should be treated as the current patch-series tip. The old `llama-paged-dev` tree is historical only. +3. **Worktree patch series is tracked through 0063.** The only expected unrelated untracked path in this worktree is `.claude/`. +4. **`sm_121a` is not in the worktree build files** - it lives only in the DGX experimental build scripts (`gdn_cc.sh`, `gdn_bv_build.sh`, `paged-build.sh`); mainline uses arch `121`. **UNVERIFIED** whether the shipped CI Dockerfile build path injects `121a` for the FP4-MMA kernels (`Dockerfile.llama-cpp-localai-paged` does not hardcode a CUDA arch). +5. **The `0921716...` paged-MoE md5 open item.** `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=0921716cd0582b5d15af8c362b811d00` for MoE, but a full doc/patch/`git log -S` grep of the worktree found **no** occurrence of `0921716...` in any committed source; the committed canonical paged-MoE gate is `8cb0ce23`. Treat this as **unreconciled**: the documented, KL-validated paged-MoE gate remains `8cb0ce23`, and any paged-MoE divergence (including `0921716`) must be KL-validated against the f16 reference before being accepted as benign, never on assertion alone. The `0921716` value is **UNVERIFIED** as a sanctioned gate; do not adopt it as canonical without re-running the KL gate. The **dense** run is symmetric: `COMBINED_DEFINITIVE.txt` records `PAGED_GATE_MD5=ecfe924dee6c5622c149f419ff2a6481` for dense, which likewise differs from the canonical dense gate `5951a5b4`. Both CDEF `PAGED_GATE_MD5` values come from the `combined_definitive.sh` harness's own gate command, NOT the canonical bit-exact gate command in section 3.3, which is why they diverge from the committed `8cb0ce23` / `5951a5b4`; neither is a sanctioned gate and both must be KL-validated before being treated as benign. + +--- + +## 8. PHASE63 RESULT: PREFILL BUCKET ATTRIBUTION + +Phase63 is complete as a measurement-only no-go. The plan is +`docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md`; the +DGX artifact is `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`. + +Pre/post gates stayed green: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`; +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439`; +- `MUL_MAT` `1146/1146`; +- `MUL_MAT_ID` `806/806`. + +The candidate paged FlashAttention mask/block-table cleanup is rejected for now: +llama.cpp FA is only `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the +`npp=2048` cross-engine FA delta is about `1.7 us/tok`, not the `15 us/tok` +needed to fund source work. No llama.cpp source files were modified. + +*Status: Phase63 closed. `VLLM_PARITY_FINAL.md` remains the GB10 shortcut record; +the remaining measured buckets are still MoE/FFN GEMM, GDN, bf16 projections, +layout copies, and activation quantization.* + +## 9. PHASE64 RESULT: LAYOUT TRACE + +Phase64 added default-off layout attribution in the llama.cpp fork: +`fa944bb5f feat(cuda): trace layout tensor names`. The env gate is +`LLAMA_LAYOUT_TRACE=`. It traces CUDA `GET_ROWS`, `CPY`, `CONT`, `DUP`, and +`CONCAT` runtime dispatch with tensor names, types, shapes, and contiguity flags. + +DGX artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519`. +Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +`MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Trace result at MoE `npp=512`, `ntg=4`, `npl=32`: + +- `get_rows`: `7268` +- `cpy`: `2008` +- `cont`: `1734` +- `concat`: `990` + +The named layout sources are GDN conv-state gather/concat/update +(`cache_r_lN`, `conv_states_reshaped-N`, `qkv_mixed_transposed-N`, +`conv_input-N`, `conv_state_update-N`), MoE top-k fan-in gathers +(`ffn_moe_probs-N`, `ffn_moe_topk-N`, `ffn_moe_weights-N`), and paged-attention +mask/KV reshape/copy paths. This does not fund a clean layout optimization yet; +it gives Phase65 the exact names needed to either remove one repeated chain or +reject it with evidence. + +## 10. PHASE65 RESULT: QUANT TRACE + +Phase65 added default-off activation-quant route attribution in the llama.cpp +fork: `afc2c7030 feat(cuda): trace activation quant routes`. The env gate is +`LLAMA_QUANT_TRACE=`. DGX mirror commit: `7863194bd`. + +DGX artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729`. +Patched build gates stayed green: MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, +`MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Trace result at MoE `npp=512`, `ntg=4`, `npl=32`: + +- `mmq_dense`: `4444` +- `mmq_moe_dedup_unique`: `2960` +- `mmq_moe_gather`: `2960` +- `mmq_moe_flat`: `1480` + +The dominant default-path shapes are MoE gate/up expert activation quant +deduplication (`K=2048`, `rows=512`) followed by gather to expert-token rows +(`rows=4096`), shared-expert dense gate/up quantization (`K=2048`, `rows=512`), +MoE down expert flat quantization (`K=512`, `rows=4096`), and shared-expert down +quantization (`K=512`, `rows=512`). This confirms the activation-quant bucket is +concentrated in named MoE/shared-expert FFN paths, but it does not prove whether +`gather_mmq_fp4` is material or just a cheap cost of the existing dedup win. +Phase66 should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX +before funding any behavior-changing source patch. + +## 11. PHASE66 RESULT: QUANT KERNEL TIMING + +Phase66 timed the Phase65 candidate kernels directly with Nsight Systems. +Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`. +Profile: `quant_npp512.nsys-rep`; summary: +`quant_npp512_kern_sum_cuda_gpu_kern_sum.csv`. + +Shape: MoE `npp=512`, `ntg=4`, `npl=32`. Total GPU kernel time: +`7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +Decision: reject a Phase66 gather/quant source patch. The gather is too small +to target, and quantize plus gather is below the `8%` source-funding threshold. +Do not reopen W4A16/no-activation-quant from this evidence; that larger rewrite +was already rejected in earlier phases. + +## 12. PHASE67 RESULT: BF16 CUBLAS F32 OUTPUT + +Phase67 added a default-off BF16 projection shortcut in the llama.cpp fork: +`ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output`. The env gate is +`LLAMA_BF16_CUBLAS_F32_OUT=1`. DGX mirror commit: `14fd69f1e`. + +DGX artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. +Default and opt-in gates stayed green: MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, `MUL_MAT 1146/1146`. + +Same-window MoE prefill A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +The opt-in `npp=512` profile removed the BF16-to-F32 conversion row: +`convert_unary<__nv_bfloat16, float>` became `0 ns`, `0` instances. Keep this +as default-off for now. It is correctness-clean and measurable, but the win is +small and needs dense plus serving A/B before any default-on decision. + +## 13. PHASE68 RESULT: BF16 F32 OUTPUT DENSE + SERVING A/B + +Phase68 reused Phase67 source unchanged. Plan: +`docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md`. +DGX artifact: `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710`; +serving A/B artifact: +`/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249`. + +Correctness basis for the exact source commit remains Phase67: default and +`LLAMA_BF16_CUBLAS_F32_OUT=1` both produced MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, and `MUL_MAT 1146/1146`. + +Dense prefill stayed positive but tiny: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +MoE serving A/B at `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: carry the shortcut as a default-off opt-in candidate. It is no longer +just a prefill-only win, but Phase68 is not enough to default it on. Any future +default-on proposal must mirror the fork commit into the LocalAI patch series +and rerun a broader current serving snapshot with pre/post md5 and op gates. + +## 14. PHASE69 RESULT: PATCH SERIES MIRROR READINESS + +Phase69 checked the patch-series state without pushing and without editing +generated patch files. Plan: +`docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md`. + +Current committed LocalAI patches still match the Phase37 fork tip: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +patch_tip_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_patch_tip=yes +match_fork_head=no +patch_count=54 +``` + +Dry-run export from `2d590d770..ea0875d14` produced ten additive source-only +patches, projected as `0064..0073`. Applying current `0001..0063` plus temp +`0064..0073` onto the pin exactly reconstructed current fork HEAD: + +```text +applied_plus_missing_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_fork_head=yes +current_patch_count=54 +missing_patch_count=10 +projected_patch_count=64 +``` + +Projected patch tail: + +- `0064` serving admission trace (`c6cb8460e`) +- `0065` admission histograms (`bd7b2e952`) +- `0066..0068` TTFT prefill-first scheduler knobs (`8a97629a4`, + `3b6ab5fa8`, `8759213e3`) +- `0069..0070` W4A16 direct-activation policy/stub (`41be3da5b`, + `7967ad47f`) +- `0071` layout trace (`fa944bb5f`) +- `0072` quant trace (`afc2c7030`) +- `0073` BF16 cuBLAS F32 output (`ea0875d14`) + +Decision: mirror regeneration is technically ready but not executed. The local +fork is `26` commits ahead of `fork/localai-paged`, and the fork-first policy +requires pushing before regenerating the LocalAI series. Do not push without +explicit approval. After approval, push the fork, regenerate `0064..0073`, rerun +the same tree-hash check, and then run the broader serving gates before any +default-on BF16 policy change. + +## 15. PHASE70 RESULT: BF16 F32 OUTPUT BROADER SERVING + +Phase70 broadened the Phase68 serving evidence without source changes. Plan: +`docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. +DGX artifact: +`/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500`. + +Gates stayed green. Default pre/post gates matched MoE md5 `8cb0ce23`, dense +md5 `5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Opt-in pre/post +gates matched MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, and `MUL_MAT +1146/1146`. + +Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`, +`PARALLEL=128`. + +| n | opt/default agg | opt/default decode | opt/default TTFT | default decode/vLLM | opt decode/vLLM | +|---:|----------------:|-------------------:|-----------------:|--------------------:|----------------:| +| `8` | `0.8896` | `0.8998` | `1.1247` | `0.8100` | `0.7289` | +| `32` | `0.9912` | `0.9974` | `1.0320` | `0.6882` | `0.6864` | +| `128` | `1.0071` | `0.9882` | `0.9852` | `0.6921` | `0.6839` | + +Decision: reject default-on for `LLAMA_BF16_CUBLAS_F32_OUT=1`. The shortcut is +correctness-clean, but it materially regressed low-concurrency serving and +slightly widened the vLLM decode gap at `n=32` and `n=128`. Keep it +default-off only and move the next parity effort to a different lever. + +## 16. PHASE71 RESULT: GDN TENSOR-CORE REVALIDATION + +Phase71 challenged the stale GDN planning docs before starting more source work. +Plan: +`docs/superpowers/plans/2026-07-01-gdn-tc-revalidation-phase71.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. +DGX artifact: +`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. + +Source under test stayed at DGX mirror commit +`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was +changed. + +Canonical gates matched for all four GDN modes: MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, and `GATED_DELTA_NET 46/46`. Default also passed `MUL_MAT +1146/1146` and `MUL_MAT_ID 806/806`. + +MoE prefill, `PP=512,2048`, `TG=4`, `B=32`, `CTX=131072`: + +| arm | npp512 S_PP | npp2048 S_PP | +|-----|------------:|-------------:| +| default | `2313.57` | `2422.88` | +| sequential-disabled (`GDN_CHUNK_MIN=2147483647`) | `2198.28` | `2361.22` | +| serial-chunked (`GDN_TC=0 GDN_CHUNK_MIN=64`) | `1787.49` | `1699.77` | +| forced M5 (`GDN_TC=4 GDN_CHUNK_MIN=64`) | `2323.18` | `2420.52` | + +Decision: keep shipped GDN M5 default behavior. It still beats +sequential-disabled by `+5.24%`/`+2.61%`, beats serial-chunked by +`+29.43%`/`+42.54%`, and forced M5 is within noise of the current default. Do +not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. + +Post-Phase71 do-not-reopen list for GB10: + +- Smaller W4A16/MoE GEMM body, metadata, direct-activation, or quant/gather + shortcuts. +- GDN C32 slab, QS-early, Global-Ai32, or another low-conflict M5 reorder. +- BF16 cuBLAS F32 output as a default-on policy. + +The only GDN work that should be reconsidered is a larger FLA/CuteDSL-class +blocked-solve implementation or a hardware pivot where the GB10 constraints no +longer apply. + +## 17. PHASE72 RESULT: TTFT MIN32 BROADER SERVING + +Phase72 broadened the Phase59 min32 scheduler result to the same serving shape +used by Phase70. Plan: +`docs/superpowers/plans/2026-07-01-ttft-min32-serving-phase72.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. +DGX artifact: +`/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`. + +Source under test stayed at DGX mirror commit +`14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. No llama.cpp source was +changed. + +Gates stayed green. Pre default matched MoE md5 `8cb0ce23`, dense md5 +`5951a5b4`, `MUL_MAT 1146/1146`, and `MUL_MAT_ID 806/806`. Pre/post min32 and +post default md5 gates also matched MoE `8cb0ce23` and dense `5951a5b4`. + +Serving shape: MoE `NPL=8 32 128`, prompt `128`, generation `64`, +`PARALLEL=128`. + +| n | min32/default agg | min32/default decode | min32/default TTFT | default decode/vLLM | min32 decode/vLLM | +|---:|------------------:|---------------------:|-------------------:|--------------------:|------------------:| +| `8` | `0.9302` | `0.9442` | `1.0379` | `0.7561` | `0.7140` | +| `32` | `0.9414` | `0.9570` | `1.0977` | `0.7158` | `0.6850` | +| `128` | `0.9699` | `0.9775` | `1.0300` | `0.6935` | `0.6779` | + +Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` plus +`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` opt-in only. It regressed aggregate, +decode, TTFT, and wall time at every tested concurrency in the broader shape, +and widened the vLLM decode gap. Do not default this scheduler policy on GB10. + +## 18. PHASE73 RESULT: DATACENTER BLACKWELL RERUN READINESS + +Phase73 is a no-new-benchmark decision/spec phase. Plan: +`docs/superpowers/plans/2026-07-01-datacenter-blackwell-rerun-readiness-phase73.md`. +Benchmark ledger: +`backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md`. + +No GPU benchmark was run and no llama.cpp source was changed. Source baseline +remains DGX mirror commit `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output`. + +Decision: + +- Do not start more GB10 grouped-MMQ/W4A16 source work. Phase61 direct-A was + the last structurally distinct W4A16 shortcut and failed its keep gate; Phase66 + quantize plus gather was only `5.10%`, below the source-funding threshold. +- Do not start GDN backend source work until a standalone C=64 blocked-solve PoC + proves timing and numerical viability. Phase71 kept M5 as shipped; the + remaining GDN gap is a larger FLA/CuteDSL-class blocked-solve/register-state + implementation, not another C32/QS/global-Ai/local reorder. +- The next parity evidence should come from datacenter Blackwell hardware with + the existing same-session serving harness plus graph-node decode profiles. + +B200 rerun checklist: + +1. Build and verify the llama.cpp paged binary on B200 or equivalent + datacenter Blackwell hardware with the correct CUDA architecture/settings. +2. Install and verify vLLM `0.23.0+` with the intended Blackwell backend stack. +3. Confirm both model forms exist: `q36-35b-a3b-nvfp4.gguf` and + `q36-35b-a3b-nvfp4-vllm`. +4. Run `paged-current-serving-snapshot.sh` with `NPL="8 32 128"`, `PTOK=128`, + `GEN=64`, `PARALLEL=128`, `CTX=131072`, and B200-specific + `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_NUM_SEQS`, and + `VLLM_TENSOR_PARALLEL_SIZE`. +5. Before interpreting the artifact, require `hardware.txt` to say + `hardware_class=datacenter_blackwell`, `gate_summary.tsv` to be green, + pre/post MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT` and + `MUL_MAT_ID` op gates green, and `summary.tsv` rows for both paged and vLLM. +6. Run decode/profile reruns with `nsys --cuda-graph-trace=node` and inspect + whether vLLM is using native FP4/CUTLASS/FlashInfer rather than the GB10 + Marlin fallback. + +Phase74 standalone GDN source-work gate result: + +```sh +nvcc -O3 -arch=sm_121a \ + ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench.cu \ + -o ~/scratch_tc_gdn_poc/gdn_blocked_solve_bench + +~/scratch_tc_gdn_poc/gdn_blocked_solve_bench \ + --c 64 --dk 128 --dv 128 \ + --iters 1000 \ + --precision tf32,offdiag3x,apply3x \ + --oracle f64 \ + --dump-json ~/bench/phase74_gdn_blocked_solve_poc/20260701_143711/phase74_gdn_blocked_solve_poc.json +``` + +Artifact: +`/home/mudler/bench/phase74_gdn_blocked_solve_poc/20260701_143711`. + +The standalone C=64 shared-memory explicit inverse-plus-apply scaffold did not +fund backend source work: + +- weak decay: direct solve/apply `3.263936 ms`; inverse-plus-apply + `5.493515 ms`; inverse/direct speed `0.5941x`; inverse NMSE `2.755e-15`; +- mixed decay: direct solve/apply `3.275959 ms`; inverse-plus-apply + `5.527584 ms`; inverse/direct speed `0.5927x`; inverse NMSE `7.541e-16`; +- shared memory was already near the GB10 cap: direct `81920` bytes, + inverse-plus-apply `98304` bytes, with `99 KB` opt-in available. + +Decision: do not touch `ggml/src/ggml-cuda/gated_delta_net.cu` for this C=64 +inverse scaffold on GB10. A future GDN source-work gate must be a substantially +different tensor-core blocked-solve/register-state design that shows a material +timing win before backend changes. + +Phase75 follow-up audit: + +- llama.cpp already ships the M5 tensor-core GDN path default-on under paged KV: + `KK/QK`, `KS/QS`, `P*U`, explicit `T=A^-1`, `U=T*RHS`, and + `Kc^T*DU` state carry are covered in the current `C=16` GB10 path. +- vLLM has a distinct one-token recurrent decode path that updates state + directly and a packed decode path that avoids Q/K/V materialization copies, + but this is not source-funded in llama.cpp without a fresh profile: prior + parity evidence showed llama.cpp GDN decode already faster than vLLM and + decode serving dominated by host/MoE synchronization. +- vLLM's CuTeDSL GDN prefill path is useful reference material for datacenter + Blackwell, but depends on SM10x/CUDA-13 features such as TMA/tcgen05/CUTLASS + DSL and should not be treated as a portable GB10 patch base until the local + toolchain proves support. + +Phase76 current-stack GB10 graph-node profile: + +- Artifact: + `/home/mudler/bench/phase76_current_moe_profile/20260701_145116`. +- Shape: MoE `q36-35b-a3b-nvfp4`, `n=128`, `PTOK=128`, `GEN=64`, + `PARALLEL=128`, `CTX=131072`, production defaults. +- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense + md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- Serving under graph-node profiling: aggregate `204.1 t/s`, decode aggregate + `320.7 t/s`, prefill `1490.1 t/s`, TTFT mean `8365.1 ms`, wall `40.146 s`. +- Bucket result: GDN was the largest macro bucket, `6669.16 ms` (`32.88%`), + ahead of MoE/FFN-GEMM `6264.88 ms` (`30.88%`) and BF16 projections + `2772.38 ms` (`13.67%`). `gdn_core` alone was `5876.94 ms` (`28.97%`). + +This supersedes the Phase75 "datacenter only unless fresh profile" wording: +Phase76 is that fresh profile. It does **not** justify an immediate backend +patch because it is llama-only and graph-node tracing depresses absolute +throughput, but it does fund one narrow GB10 follow-up before waiting for B200: +prove whether vLLM's direct recurrent/packed decode idea can reduce the current +`gdn_core` bucket. + +Current next gate: + +1. Keep the B200/B100/GB200 Phase72 same-session rerun as the hardware-pivot + gate when datacenter Blackwell is available. +2. In parallel on GB10, run a Phase77 GDN decode proof with pre/post md5 and op + gates. Accept only if it materially reduces the Phase76 `gdn_core` bucket and + does not regress serving throughput or canonical output md5. +3. Do not merge or default-on any `gated_delta_net.cu` change from this evidence + alone; Phase76 is a profile gate, not a source patch gate. + +Phase77 decode-only profile result: + +- Artifact: + `/home/mudler/bench/phase77_moe_decode_only_profile/20260701_150134`. +- Shape: MoE `q36-35b-a3b-nvfp4`, `N=128`, long-running `/completion` + requests, `N_PREDICT=2048`, capture after active decode. +- Capture window: active slots `128`; median decoded depth `67` at start and + `89` mid-capture. +- Pre/post gates were green: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense + md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- Bucket result: GDN `1489.71 ms` (`41.20%`) and MoE/FFN-GEMM `1400.77 ms` + (`38.74%`). Fine bucket `gdn_core` was `1408.33 ms` (`38.95%`), slightly + larger than `mmq_nvfp4` at `1383.50 ms` (`38.26%`). + +Phase77 supersedes the Phase75 "no GB10 GDN source work" stance for decode +only. Do **not** reopen the failed C=64 prefill inverse scaffold. The funded +GB10 source path is now a narrow, default-off GDN decode A/B or standalone PoC +based on vLLM's direct recurrent/packed decode structure. The next patch must +prove a material reduction in the Phase77 `gdn_core` bucket, keep canonical md5 +and op gates green, and avoid serving/decode throughput regression under the +same decode-only capture shape before it can be considered for merge or default. + +Phase78 launch-shape sweep: + +- Baseline: Phase77 default launch shape (`GDN_NW=16 GDN_CPW=8`) had + `gdn_core 1408.33 ms` (`38.95%`) in the decode-only window. +- `GDN_NW=8 GDN_CPW=8` artifact: + `/home/mudler/bench/phase78_gdn_launch_sweep/nw8_cpw8_20260701_150654`. + Gates were green, but `gdn_core` worsened to `1443.55 ms` (`39.68%`). +- `GDN_NW=16 GDN_CPW=4` artifact: + `/home/mudler/bench/phase78_gdn_launch_sweep/nw16_cpw4_20260701_150954`. + Rejected before profiling: `MUL_MAT_ID` failed `805/806`. + +Decision: keep default `GDN_NW=16 GDN_CPW=8`. Do not retry existing +`GDN_NW`/`GDN_CPW` launch-shape retunes unless a new profile gives a specific +reason. The next GB10 source-funded work must be structural, default-off, and +measured against the Phase77 decode-only `gdn_core` bucket. + +Phase79 BV32 decode source A/B: + +- Artifact root: + `/home/mudler/bench/phase79_gdn_decode_bv32_ab/20260701_152530`. +- Candidate source tree: + `/home/mudler/llama-phase79-gdn-source`. +- Candidate patch: one-file default-off CUDA decode-only kernel in + `ggml/src/ggml-cuda/gated_delta_net.cu`, enabled by `GDN_DECODE_BV32=1`. + Scope was `S_v=128`, one-token decode, scalar gate, final-state write-back. +- Candidate build completed for `llama-completion`, `llama-batched-bench`, + `test-backend-ops`, and `llama-server`. +- Safety gates were green for the candidate default and opt-in paths: + MoE md5 `8cb0ce23`, dense md5 `5951a5b4`, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- A/B baseline post-gate initially failed `MUL_MAT 1145/1146` on a q4_1 case + after profiling, but immediate retry in `A_baseline/gate_post_retry` was + green; treat the first failure as a gate hiccup, not as accepted evidence. + +Result: + +| arm | env | GDN ms | `gdn_core` ms | `gdn_core` launches | `gdn_core`/launch | +|-----|-----|-------:|--------------:|--------------------:|------------------:| +| baseline | none | `1493.14` | `1411.46` | `600` | `2.352 ms` | +| BV32 | `GDN_DECODE_BV32=1` | `1502.89` | `1426.17` | `570` | `2.502 ms` | + +Decision: reject the BV32 decode topology. It passed md5/op gates but worsened +normalized `gdn_core` by about `6.4%` per launch and increased the GDN macro +bucket. Do not carry this source patch forward. The next GDN source hypothesis +should target recurrent-state precision/traffic or another structural delta +from vLLM; reduced-precision recurrent state is promising but invasive and needs +a separate scope. + +Phase80 identity-ids shortcut source A/B: + +- Artifact root: + `/home/mudler/bench/phase80_gdn_identity_ids_ab/20260701_153927`. +- Candidate source tree: + `/home/mudler/llama-phase80-gdn-identity-source`. +- Candidate patch: one-file default-off shortcut in + `ggml/src/ggml-cuda/gated_delta_net.cu`, enabled by + `GDN_ASSUME_IDENTITY_IDS=1`. It skips the GDN scratch gather for one-token + final-state decode by assuming `ids[s] == rs_head+s` and reading from + `state_dst` directly. +- Candidate build completed for `llama-completion`, `llama-batched-bench`, + `test-backend-ops`, and `llama-server`. +- Baseline and candidate pre/post gates were green: MoE md5 `8cb0ce23`, dense + md5 `5951a5b4`, `MUL_MAT 1146/1146`, `MUL_MAT_ID 806/806`. + +Result: + +| arm | env | GDN ms | `gdn_core` ms | `gdn_gather` ms | GDN macro launches | +|-----|-----|-------:|--------------:|----------------:|------------------:| +| baseline | none | `1493.57` | `1411.65` | `0.79` | `3600` | +| identity shortcut | `GDN_ASSUME_IDENTITY_IDS=1` | `1489.96` | `1409.28` | not present | `3000` | + +Decision: reject carry-forward/default. The shortcut safely removes the tiny +`gdn_gather` bucket in this shape, but `gdn_core` is unchanged and the identity +assumption is too narrow for a sub-millisecond capture-level win. Do not spend +more parity time on gather-only GDN shortcuts unless a future profile makes +gather material. The next serious GDN scope remains recurrent-state +precision/traffic. + +## Series trim (phases 110-140 review, 2026-07-02) + +The campaign's on-disk patches `0048-0063` were added without matching fork +commits (a fork-first policy violation). After a keep/drop review of the +phase 110-140 work, the series was trimmed to a single kept line plus the +gate harness, and re-mirrored to the fork: + +- KEEP - test sentinels (the MoE gate harness): `MOE_SWIGLU_DOWN`, + `MOE_SWIGLU_COMBINE`, `MUL_MAT_ID_RAGGED_MOE` (old `0051-0053`). +- KEEP - the MTP-draft correctness fix (old `0054`): forces target-side + sampler acceptance for MTP drafts (backend draft sampling can request + multiple output rows per sequence); the backend ships `-mtp` gallery models. +- KEEP - the Phase135 routed-FFN fused-quant line: whole-pattern MoE matcher + + routed-FFN executor hook (Phase120/121), the routed-FFN PoC scaffold + `moe-ffn.{cu,cuh}` (Phase132), and the fused SwiGLU-to-NVFP4-quant + raw down + MMQ (`ggml_cuda_mul_mat_q_moe_quantized` + local `ggml_cuda_mmq_ids_meta` + refactor, Phase135). All default-off, md5-clean opt-in, six + `mmq_moe_quantized_raw` markers with zero sorted launches on the sentinel. + +- DROP - W4A16 grouped-tile pack/tune/pad (old `0048-0050`): dead line, W4A16 + is ~1.5x slower than grouped-MMQ. +- DROP - speculative/trace/cublas-route/mmid-route/mul-mat-route traces + the + rejected small-M tile-policy knob (old `0055-0063`). +- DROP - all other campaign keep-markers not needed by Phase135: GPU-sort + (Phase110), W4A16-direct-A (Phase112), boundary trace/timing (Phase117), + Phase133 sorted-F32 down, Phase134 fused-SWIGLU-only, Phase138 + finalize/weighted-combine. The final fork tree carries zero of these markers. + +Fork branch `mudler/llama.cpp:localai-paged` re-mirrored on top of +`51168c5ee` (LocalAI series `0001-0047`): + +- `fd920cf8a` test(paged): cover MoE swiglu down chain +- `a85c1e098` test(paged): cover MoE weighted combine chain +- `2fed6aacf` test(paged): cover ragged MoE dispatch +- `f1d976f06` fix(speculative): disable backend sampling for MTP drafts +- `1edddc8fe` feat(paged): whole-pattern MoE matcher + routed-FFN fused + NVFP4-quant down MMQ + +New fork HEAD `1edddc8fe`, tree `097c862c`. The rejected/neutral levers of +the 110-140 campaign are recorded above and in the per-phase bench artifacts. + +## P1 bf16-native execution pass - LANDED (2026-07-02) + +First phase of the `EXECUTION_REARCH_SCOPE.md` additive program to land. +`LLAMA_BF16_STREAM` (default-off) runs a bf16-resident residual-segment +executor for the q36 MoE decision model's projection boundaries, deleting the +per-op `f32->bf16` convert the stock cuBLAS-bf16 path pays at the projection +`src1`. See the "P1 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the +full record; summary and provenance: + +- **Verdict: GO / SHIP.** P0 kill-gate GO, P1 build-out and independent verify + all correctness gates green, prefill positive-and-reproducible, KL-improving. +- **Key reframe:** q36 GDN/attention projections (attn_qkv/gate, + ssm_alpha/beta/out) are **BF16 weights, not NVFP4** - only the MoE experts + (`ffn_*_exps`) are NVFP4. The convert tax lives at the BF16 cuBLAS projection + boundary (`op_mul_mat_cublas` src0==BF16), so bf16-stream is a **MoE-model + lever**; the dense model quantizes those projections to NVFP4 and engages + nothing (stays bit-identical). +- **Engagement:** P0 = 960 gate_norm->ssm_out segments/prefill; full build-out = + 2240 (960 single-consumer 0044 ssm_out + 1280 multi-consumer plain-rms_norm -> + {attn q/k/v, GDN in_proj}). +- **Prefill (MoE @512 B=32):** +1.99% (2361.67 vs 2315.52 t/s, all 5 bf16 > all 5 + ctrl; reproduced +1.89%); @2048 +0.95%; dense no-op (-0.09%). Recovered ~8.44 + us/tok @512. At the noise floor -> classified neutral but reproducible; no + regression. +- **KL (MoE):** bf16 KLD 0.136042 vs control 0.136563 => delta -0.00052 (bf16 + slightly better via the `LLAMA_BF16_CUBLAS_F32_OUT` plank keeping the full f32 + GEMM result); same-top-p 84.461% vs 83.725% (>= 84% baseline). Dense: 0 + engagements => bit-identical. +- **Correctness:** default md5 canonical both models (MoE `8cb0ce23`, dense + `5951a5b4`) present-but-off and env-on (small-M bails); `test-backend-ops` + MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN + 7/7, MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. +- **Honest scope:** targets prefill bucket 3 (the ~4.8%-of-wall convert/glue + tax) only, and owns the projection-boundary portion of it (~40% end-to-end) - + not the GDN-scan (bucket 1, P5) or GEMM-tiling (bucket 2, P2/P3) buckets. Well + below the scope's optimistic ~45 us/tok target by construction. Next increment + = own the bf16->f32 dst direction + the remaining attn_norm-fed projection + src1 converts. +- **Deferred (blocked by an external imatrix job contending the GB10, NOT a + failed gate):** the nsys graph-node bucket table, decode S_TG @npl128, and the + Phase130 serving A/B need a clean idle-GPU re-run. + +Fork branch `mudler/llama.cpp:localai-paged` fast-forwarded on top of +`1edddc8fe` (LocalAI series `0001-0052`) with three P1 commits: + +- `1271488fc` feat(paged): P1 bf16-stream residual-segment executor + + norm-bf16 kernels (+ the re-introduced `LLAMA_BF16_CUBLAS_F32_OUT` plank) +- `91373e1b9` feat(paged): P1 bf16-stream bf16 residual-add + rope op-variants +- `653bb2f3d` test(paged): P1 bf16-stream BF16_STREAM_SEGMENT sentinel + +New fork HEAD `653bb2f3d`, tree `6cf1523047`. LocalAI series regenerated +additively as `0053-0055` (46 patches total, `0001-0052` untouched); kill-gate +at pin `0ed235ea` applied all patches and staged tree `6cf1523047` byte-for-byte +== fork HEAD tree. Nothing pushed. Artifacts: +`~/bench/p1_bf16_stream/killgate_20260702_135544` and `.../verify_20260702_161229` +on the DGX; fork topic branch `p1-bf16-stream` retained for forensics. + +## P2 expert-major fused MoE region - NO-GO (recorded 2026-07-02) + +Second phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 +kill-gate for `LLAMA_MOE_REGION_EXECUTOR` (default-off) returned **NO-GO on two +independent signals**, so per the phased contract nothing was built beyond P0 and +nothing landed. See the "P2 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for +the full record; summary and provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP.** The expected-recovery line (~40 of the +56.5 + bucket-2 prefill tax + ~11 ms decode residual) was **not** delivered - the + layout-only expert-major region is flat on its own sentinel and engages 0x on the + decision model. +- **(1) Primary GO metric flat.** Kill-gate needed the n=257 batched-large-M + `MOE_SWIGLU_DOWN` rows to beat the grouped-MMQ control by > 5%. Measured (5x + medians): control 1021.61 us, region 1022.15 us => **-0.05%** (marginally + slower); n=128 -0.34%; `MUL_MAT_ID_RAGGED_MOE` (region never engages) n=257 + +0.48% / n=128 +0.28% (noise). All four inside the 5-sample spread. This + reproduces the six prior one-boundary transplants (phases 113/114/122/123/125/ + 127) - the null hypothesis P2 had to beat. A compact expert-major *layout* + a + single sort, with both GEMMs still ragged grouped-MMQ, does not change the + ragged-tile tiling that owns the +56.5 tax; that needs P3's Marlin + persistent-CTA, not a P2 layout swap. (Sentinel caveat: `eval_perf` duplicates + only the down node ~n_runs times, so the region invocation is ~1/n_runs of the + signal => under-sensitive; reported as the requested metric, corroborated by + signal 2.) +- **(2) Decisive structural blocker (prerequisite gap).** `q36-35b-a3b-nvfp4.gguf` + ships **separate** `ffn_gate_exps` + `ffn_up_exps` (+ per-tensor + `.scale`/`.input_scale`), NOT a merged `ffn_gate_up_exps` (GGUF tensor-name scan). + `llama-graph.cpp` `build_moe_ffn` takes the separate-gate/up + `ggml_swiglu_split` + branch, so the whole-pattern matcher's merged + `gate_up(MUL_MAT_ID)->VIEW->VIEW->SWIGLU->down` shape is **absent**. The matcher, + the region executor, AND the pre-existing POC/fused-quant all engage **0x** on + q36 in prefill and decode. The region only engages on the synthetic merged-shape + test sentinel. Even a positive sentinel could not translate to q36 without first + rebuilding the seam for the separate/scaled/swiglu-split shape. +- **KL: vacuously identical.** control and region KLD both 0.136563, same-top-p + both 83.725% => delta 0.000000 (byte-identical only because the region engages 0x + on q36; not an executor KL-neutrality claim). +- **S_PP @512 (5x):** control 2320.62 vs region 2316.70 t/s = -0.17% (flat, + region == control at 0 engagement; stdev 0.24% => capture-stable, no re-capture + thrash). +- **Correctness GREEN, both arms** (default AND env-on): MUL_MAT 1146/1146, + MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 8/8, + MUL_MAT_ID_RAGGED_MOE 6/6, BF16_STREAM_SEGMENT 4/4. Default md5 canonical both + models (MoE `8cb0ce23`, dense `5951a5b4`); env-on canonical (small-M bails). +- **Prerequisite handoff (gates P2 AND P3).** Before any MoE-region lever can + engage on q36, re-scope and rebuild the seam (whole-pattern matcher + + POC/fused-quant + region executor) for q36's separate `ffn_gate_exps`/ + `ffn_up_exps` + per-tensor `.scale` + `ggml_swiglu_split` FFN shape. Then + re-evaluate a *fused two-GEMM* region (not a layout swap), per the scope's null + hypothesis that the win exists only as the complete fused kernel that never + materialises the intermediates. + +Implementation (correct, committed, NOT pushed, ~407 LOC / 6 files): +`moe-ffn.cu` `ggml_cuda_moe_region_executor` (one route-sort ids_meta; gate_up +grouped NVFP4 MMQ writes a compact expert-major buffer via iota ids_dst, token-order +intermediate never materialised; `moe_swiglu_nvfp4_quant_compact_kernel` reads by +route-slot; down MMQ unpermutes) + strict all-consumers guard +`ggml_cuda_moe_region_consumers_ok` + `LLAMA_MOE_REGION_TRACE`. + +Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46 +patches (`0001-0055`). Topic branch `mudler/llama.cpp:p2-moe-region` retained for +forensics at `2d87564ddfa26f6c275dad0e1f0e3d8d5413e337` (base `653bb2f3d`, NOT +pushed). Artifacts on the DGX: `~/bench/p2_moe_region/focused_20260702_172644/` +(sentinels 5x, correctness OFF+ON, md5, S_PP@512 5x, KL) + `RESULTS.txt`, +`.../killgate_20260702_171826/` (engagement proof, 0x on both models), +`.../build_20260702_145928/` (build logs). + +## P3 W4A16 direct-A Marlin GEMM (forensics retry) - NO-GO at the perf kill-gate; the GEMM-tiling prefill bucket is a CONFIRMED FP4-MMQ-OPTIMAL FLOOR (recorded 2026-07-02) + +Third phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and the **last big prefill +lever**. The trimmed direct-A W4A16 prototype (`7967ad47f`) was **re-created per the +section-3 contract** on top of the in-tree grouped 0035 Marlin kernel, engaged behind +`LLAMA_W4A16_DIRECT_A=1` + `LLAMA_W4A16_PREFILL_M>0` (default-off), and A/B'd against the +FP4-MMQ default at the P0 kill-gate. It lost by a wide margin, so `go=false` was the +kill-gate default, nothing was built beyond P0, and nothing landed. See the "P3 RESULT" +subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; this closes the last +speculative prefill lever and completes the program's prefill verdict. Summary and +provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate; the forensics retry is REFUTED.** The + retry hypothesis (section 2d) was that the prior 0035 -39% / -19.6% loss was ggml + *integration tax* (f32->bf16 converts + a separate act-quant/sort pass), not + kernel-intrinsic. P3 **genuinely removed that tax** (act-quant 18.92 -> ~0 us/tok on the + expert path; host expert-sort + src1-gather + separate cast pass eliminated) and direct-A + **still lost**. Removing the tax did not close the gap => the loss is **kernel-intrinsic + on GB10**, and **bucket 2 (GEMM tiling, +56.5 us/tok) is a confirmed FP4-MMQ-optimal + floor**, joining bucket 1 (GDN scan, P5-confirmed). +- **PERF GO GATE FAILED DECISIVELY.** GO required `S_PP(direct-A) >= FP4-MMQ + 5%` at + `M >= 1024` AND `KLD <= 0.137`. Measured (MoE `q36-35b-a3b-nvfp4`, killgate 3-iter + medians, `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -ngl99 -fa on -ntg4 -npl32 -c73728`): + **npp512 1176.8 vs 2215.3 = -46.88%; npp1024 1201.1 vs 2309.7 = -48.00%; npp2048 1222.0 + vs 2400.2 = -49.09%** (direct-A stdev 0.07-0.56%, clears `max(2%, 3sigma)`). The + calibrated **grouped**-W4A16 (0035) null baseline reproduced **-43.96/-43.58/-44.72%**, + matching/exceeding the historical -39% / -19.6% - so the harness is unchanged and direct-A + is even slower than grouped (the in-kernel f32 A-gather pessimizes the kernel). Dense null + control **+0.05%** (MoE-only `mul_mat_id` hook; dense projections are plain `mul_mat`). +- **ROOT CAUSE (nsys `--cuda-graph-trace=node`, npp2048 graph-node buckets).** The mature + bf16 grouped-W4A16 expert GEMM = **323.90 us/tok = 1.97x** the FP4-MMQ int8 expert GEMM + (**164.6**) = exactly the **bf16 = half int8/FP4 tensor-core peak on sm_121**. Consumer + Blackwell has no bf16-peak headroom over FP4/int8. **Novel sub-finding:** fusing the + A-gather in-kernel (direct-A) is a **NET PESSIMIZATION** vs a cheap separate bf16 pre-cast + - it drove the kernel **323.90 -> 451.86 us/tok (+127.96)** while removing only ~63 of tax, + a GB10-specific **inversion of P5's no-round-trips heuristic** (an in-kernel f32 gather + doubles A traffic and halves occupancy; a full-occupancy bf16 pre-cast is cheaper on + LPDDR5x). Residual +30 us/tok dst-unsort `get_rows` the host-loop keeps (FP4-MMQ fuses + on-device) is real but ~1/10 of the ~2x kernel gap. +- **KL BAND GREEN and better than control.** direct-A **KLD 0.130260, same-top-p 85.172%** + (16-chunk) vs FP4-MMQ ctrl 0.136563 / 83.725% => in-band (<0.137, top-p >= 84) and + *better*. Correctness was never the issue; the bf16-dequant W4A16 path is + KL-benign-and-better, just slower. +- **Correctness GREEN, both arms** (default AND `LLAMA_W4A16_DIRECT_A=1`): default md5 + canonical both models (MoE `8cb0ce23`, dense `5951a5b4`), env-on also canonical both + (small-M bails to byte-identical default); `test-backend-ops` default MUL_MAT 1146/1146, + MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, MOE_SWIGLU_DOWN 7/7, MUL_MAT_ID_RAGGED_MOE + 6/6, plus DIRECT_A-on MUL_MAT_ID 806/806. **Engagement PROVEN:** 7680 direct-A + engagements env-on (K=2048 N=512 gate/up expert GEMM), 0 in default. +- **Honest delta vs the ~40-50 of +56.5 expectation.** Combined P2+P3 targeted ~40-50 of + the bucket-2 tax. **Delivered: 0.** FP4-MMQ is optimal on GB10; the ceiling lifts only on + datacenter Blackwell (tcgen05 / CUTLASS grouped-FP4). Corroborated by + `VLLM_PARITY_LEVER_MAP.md:1100` (offline-repack + verbatim vLLM Marlin already rejected + -39% at the same bf16-peak ceiling) - which is **why the one-time host-side repack cache + was deliberately NOT built** (a repack changes layout, not mma dtype; it cannot move a + 1.97-2.74x bf16-peak floor). Documented decision, not an omission. + +Implementation (re-created per the section-3 contract; correct, committed, NOT pushed): +`w4a16-policy.h` (pure host-testable engage predicate: NVFP4 src0 + f32 src1/dst + +Blackwell + `LLAMA_W4A16_DIRECT_A=1` + `LLAMA_W4A16_PREFILL_M>0` + tokens>M + +`k%64==0 && n%128==0` + src1 row-contiguous) + `tests/test-cuda-w4a16-policy.cpp` (14/14 +host unit test); `w4a16-gemm.{cu,cuh}` direct-A kernel (reads src1 f32 directly via +`ids_to_sorted`, fuses f32->bf16 in the A-load, no `get_rows`/cast/intermediate, +dequant-once weight reuse) + host launcher; `ggml-cuda.cu` `mul_mat_id` hook. Two A-fusion +variants A/B'd: v1 cp.async f32-staging+smem-convert (57 KB smem, ~1201 t/s @npp1024, +committed best) and v2 synchronous low-smem gather+convert (17 KB, ~975 t/s, worse); both +< grouped < FP4-MMQ. + +Protocols honored: GPU lock held throughout and released; `LLAMA_MAX_BATCH_TOKENS` unset; +sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter S_PP medians + sigma. Fork `localai-paged` +HEAD **untouched at `653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; +topic branches `p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` / `p5-fla-gdn` left intact. +Topic branch `mudler/llama.cpp:p3-w4a16-direct` retained on the DGX fork at +`8eef7ba4335ffd2ed7babd5e5dae71fa1fe8f688` (base `653bb2f3d`, **NOT pushed, NOT landed**). +Artifacts on the DGX `~/bench/p3_w4a16_direct/`: `calib_20260702_232353/` (grouped-W4A16 vs +FP4-MMQ calibration baseline), `killgate_20260702_235119/` (S_PP A/B 3 shapes x 3-arm x +3-iter + dense null + engagement + md5 + test-backend-ops; RESULTS.txt), +`nsyskl_20260703_001212/` (`nsys --cuda-graph-trace=node` `prof_{default,da,gr}.nsys-rep` + +`kern_*.csv` buckets + 16-chunk KL `kl_{ctrl,da,gr}.log`; RESULTS.txt), `build_v1r_*.log`. + +## P4 token-granular continuous-batching scheduler (CBv2) - NO-GO at the perf kill-gate (recorded 2026-07-02) + +Fourth phase of the `EXECUTION_REARCH_SCOPE.md` additive program. The P0 kill-gate +subset for `LLAMA_CONTINUOUS_BATCH_V2` (default-off) was **implemented and +correctness-proven green**, but the kill-gate's stated GO criterion - a **> 20% +TTFT-under-load drop** with md5 green and serving-aggregate not regressed - was +**NOT demonstrated**, so per the phased contract `go=false` was the kill-gate +default, nothing was built beyond P0, and nothing landed. See the "P4 RESULT" +subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; summary and +provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (scope-anticipated).** The P4 + section frames CBv2 on GB10 as a TTFT + fairness + architecture-enabler lever, + **not** a throughput lever (decode is GPU-compute-bound; the host-loop-dead + measurement is real), so a NO-GO on the TTFT perf gate is the expected outcome and + any throughput payoff is non-GB10 (out of scope). +- **FINAL MEASURED VERDICT (A/B completed autonomously after the forced report; + 60/60 raws, 5 reps/arm/shape; `dgx:~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md`): + NO-GO CONFIRMED, and stronger than flat: CBv2-at-this-granularity REGRESSES.** + TTFT-GO shapes: NONE. staggered N=32 TTFT p50 **+33.6% WORSE** (4559 -> 6091 ms, + clears noise), mean +31.4% worse; staggered N=128 TTFT p50 +15.5% / mean +17.9% + worse AND **aggregate/decode-agg -6.9% regressed beyond noise**; burst N=128 TTFT + +10-13% worse, agg -3.9%; N=8 shapes neutral; the one positive was burst N=32 + decode-agg +36.3% on a very noisy shape. ANALYSIS (do not re-litigate): fair-share + chunked prefill is processor-sharing and delays every near-uniform prompt's prefill + completion versus run-to-completion admission, so TTFT rises by construction; the + "TTFT scaling is scheduler-shaped" premise is PARTIALLY REFUTED for GB10 - patch + 0016's decode-first budget already captures the schedulable win, and vLLM's TTFT + advantage here is dominated by its 2.6-2.8x prefill compute. TTFT parity routes + through P3/P5 (prefill compute), not the scheduler. Fair-share may still pay on + mixed long/short-prompt workloads and non-GB10 (host-bound) silicon; out of scope. +- **Correctness gates all GREEN (DGX GB10, sm_121a), the substantive P0 result.** + Behind `LLAMA_CONTINUOUS_BATCH_V2=1` (default OFF, byte-identical off): + - **(a) canonical md5 GREEN both models, default-off AND cbv2-on:** paged-MoE + `8cb0ce23`, dense `5951a5b4`. + - **(c) `test-backend-ops` GREEN (zero-ggml side-effect proof):** MUL_MAT + 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46. + - **(c) cursor-interleave PROVEN** (`LLAMA_CBV2_TRACE`, staggered N=20): steps + co-batch decode AND prefill tokens with per-slot cursors advancing across steps + (step=6: `n_decode_toks=5 n_prefill_toks=1535 n_seqs=20`, 15 partial cursors; + slot s112 144/523 -> 281 -> 418 -> 519 over steps 6-9 while decode runs); adaptive + fair-share cap tracks live load (410@5w, 171@12, 137@15); `dbucket==n_decode` => + no fixed pad-to-parallel. + - **(b) determinism = CBv2 NEUTRAL / correctness-preserving.** The paged concurrent + greedy path is inherently non-deterministic run-to-run in the BASELINE too (a + benign near-tied-argmax / co-batch FP-reduction-order property, + `PAGED_BITEXACT_NOTE`), so the literal exact-match gate is unsatisfiable by any + scheduler (control fails it too). The discriminating test - does CBv2 diverge + from control more than control diverges from itself - PASSES across 8 configs + {dense,moe} x {degenerate,natural} x {gen8,gen64}: cross-arm divergence tracks + the within-arm baseline to +/-1-3 of 32. Single-sequence greedy is fully + deterministic (the md5 gate). +- **What would change the verdict (re-score path).** Read the finalized DGX + `~/bench/p4_cbv2/perf_20260702_194359/RESULTS.md` once the CANDIDATE arm completes + (`p4_agg.py` auto-writes medians+stdev with the `> 20%`-drop GO logic baked in). If + it shows a genuine `> 20%` staggered-TTFT drop clearing `max(2%, 3*stdev)` with md5 + green and aggregate not regressed, re-score `go=true` and trigger the full P4 + build-out: `SLOT_STATE_PREEMPTED` + release-KV-keep-prompt re-admit (paged + burst-reclaim 0024 + `paged-alloc.cpp` defrag), aging/starvation-freedom + a + constructed starvation test, preemption/aging unit tests, and a forced-preemption + byte-identical-resume determinism gate. Else this NO-GO stands. + +Implementation (kill-gate subset only; correct, committed, NOT pushed; server-side +only, ZERO `ggml/` files, ~68 LOC): `tools/server/server-context.cpp` thin +integration + a NEW pure unit-tested header `tools/server/server-admission-policy.h` +(namespace `cbv2`) + `server-admission-policy-test.cpp`. (1) Per-seq chunked-prefill +cursors with a load-adaptive fair-share cap `ceil(prefill_leftover/n_waiting)` +floored at `LLAMA_CBV2_CHUNK_MIN` (default 128, NOT `n_ubatch`, so a 512 prompt +actually chunks under load); CBv2 activates the shipped 0016 decode-first budget by +default (`T=n_batch`) and replaces 0016's fixed cap; cursor = +`slot.prompt.n_tokens()`. (2) Adaptive decode bucket policy (`LLAMA_CBV2_DECODE_PAD` +default 0 => `bucket==n_decode`, no padding per `DECODE_SERVING_SCOPE.md` +net-negative; policy computed+traced only, never fed to batch formation => +bit-exact-safe; row-emission is the deferred [Build phase]). Trace under +`LLAMA_CBV2_TRACE=1`. + +Series-numbering flag: P0 comments label `[paged 0056]` per the fork's next slot, +but the LocalAI worktree README is already ahead at `0056-0061` (MoE MMQ trace +series) - reconcile on landing (likely `0062`). + +Fork `localai-paged` HEAD **untouched at `653bb2f3d`**; LocalAI series stays at 46 +patches (`0001-0055`). Topic branch `mudler/llama.cpp:p4-cbv2` retained at +`ebb649335fe7686524a3630ee2fdffce44be6d52` (base `653bb2f3d`, NOT pushed). Artifacts +on the DGX `~/bench/p4_cbv2/`: `build_20260702_192141/` (build.log), +`gates_20260702_192632/` (SUMMARY.txt: md5 x4, test-backend-ops, cbv2_trace.txt, +determinism tsvs), `det2_20260702_193123/` + `det3_20260702_193649/` + +`det4_20260702_194040/` (determinism diff-matrix), `perf_20260702_194359/` +(raw_*.json + auto-written RESULTS.md). Environment: `LLAMA_KV_PAGED=1 +LLAMA_MOE_FORCE_GRAPHS=1`, `LLAMA_MAX_BATCH_TOKENS` unset, sm_121a, GPU lock held. + +## P5 FLA-faithful GDN prefill scan (blocked solve_tril port) - NO-GO at the perf kill-gate; the GDN prefill bucket is a CONFIRMED SHARED-HARDWARE FLOOR (recorded 2026-07-02) + +Fifth phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and its **strictest +kill-gate**. The full six-kernel vLLM-FLA `chunk_gated_delta_rule_fwd` pipeline was +**ported to CUDA tf32 mma, per-kernel validated vs a host fp64 reference, integrated +behind `LLAMA_GDN_FLA_CHUNK=1` (default-off), and A/B'd in-backend** against the shipped +M5 f32 chunked scan. It lost decisively and by the wrong sign, so `go=false` was the +kill-gate default, nothing was built beyond P0, and nothing landed. See the "P5 RESULT" +subsection in `EXECUTION_REARCH_SCOPE.md` for the full record; this closes the last +speculative prefill lever in the program. Summary and provenance: + +- **Verdict: NO-GO / DO-NOT-SHIP at the perf gate (the scope-anticipated "expected + null").** The P5 section framed this as the strictest kill-gate given Phase74's + standalone blocked-inverse 0.59x. P5 delivered the one thing prior evidence lacked: + **the whole FLA pipeline in-backend with the register/smem-resident inter-chunk state + and the chunk loop in-kernel** - the form that "was never actually tested in-backend." + It was tested here and **settles the GDN prefill bucket (bucket 1, +59.2, the single + largest prefill lever) as a shared-hardware / memory-bandwidth floor on GB10.** +- **PERF GO GATE FAILED DECISIVELY.** GO required the in-pipeline blocked `solve_tril` + to beat M5 by **> 10% at npp2048**. Measured (nsys `--cuda-graph-trace=node`, MoE + `q36-35b-a3b-nvfp4`, per distinct token over 30 GDN layers): **npp2048 M5 56.31 vs FLA + 119.46 us/tok = FLA 2.12x SLOWER** (`gdn_delta_pct_2048 = -112.1`); **npp512 M5 51.23 + vs FLA 117.35 = 2.29x slower.** End-to-end **S_PP regressed MoE -13.33% @npp2048 / + -13.12% @npp512** (3-rep medians; wrong sign, no 3-sigma question). The shipped M5 + stays `gdn_core` at **56.31 us/tok = 64.82% of vLLM's FLA chunk-64 36.5 us/tok**; the + rejected FLA port was only **30.55% of vLLM** (36.5/119.46). Reproduces Phase74's + standalone 0.59x and extends bf16-C64 (-18.75%), now **confirmed in-backend.** +- **WHERE THE TIME WENT (the novel, valuable decomposition - why this NO-GO matters).** + Per-kernel nsys share of the FLA bucket: **blocked `solve_tril` only ~2.8% (55.6 ms)** - + the algorithm the phase was about is *cheap*. Dominated by **`fwd_h` 46.2% (903 ms) + + `fwd_o` 31.5% (617 ms)**: the inter-chunk state-recurrence GEMMs + the **per-chunk + h-state materialization to global LPDDR5x** that FLA's split-kernel structure forces + (`fwd_h` writes `h_pre` per chunk, `fwd_o` re-reads it). The fused M5 single kernel + keeps the 128x128 state **resident in smem, never materializes per-chunk h**, so it is + **2.1x faster on GB10's low-bandwidth memory.** Novel finding vs all prior evidence: + **the blocked solve is not the floor - the floor is the state-GEMM + h-materialization + region, which the FLA structure makes WORSE than M5.** The binding silicon property is + **LPDDR5x memory bandwidth** (per-chunk h round-trip), compounded by the **99 KB smem + cap** that forces the `fwd_h`/`fwd_o` split - not mma shapes or wave count. +- **Correctness / cap gates (recorded, not decisive):** + - **SMEM GATE PASSES** (all six kernels under the 99 KB opt-in cap at C=64; + `cudaOccupancyMaxActiveBlocksPerMultiprocessor`): `k_kkt` 48 KB / 2 blk, `k_solve` + 38 KB / 2 blk, `k_wu` 48 KB / 2 blk, `k_fwdh` 80 KB / 1 blk, `k_fwdo` 96 KB / 1 blk - + **max 96 KB < 99 KB.** The kernels fit; they are bandwidth-floored above M5. + - **KL BAND GREEN / in-band:** FLA `KLD 0.137028` vs control `0.136563` = **delta + +0.000465 < 0.01**; same-top-p **84.61% vs 83.73%** control (>= 84% baseline). + Per-kernel bring-up vs host fp64: **o NMSE 2.2e-7, final-state 1.2e-7** (before + integration, per the "do not debug six kernels blind" rule). + - **DEFAULT PATH UNTOUCHED:** canonical md5 GREEN both models, **default-off AND + `LLAMA_GDN_FLA_CHUNK`-on** (paged-MoE `8cb0ce23`, dense `5951a5b4`; small-M greedy + bails to M5). `test-backend-ops GATED_DELTA_NET` **DEFAULT 46/46 OK.** Decode + untouched (`GDN_CHUNK_MIN` untouched; decode stays on the sequential recurrence). + - **`test-backend-ops` env-on = 43-44/46** (`gdn_op_tests_env_on_green=false`): the + FLA-engaged `head_size=128, n_seq_tokens>=64` cases marginally exceed `1e-7` + (**ERR 1.03-1.06e-7**, fluctuating across the boundary) because the port uses plain + **tf32** where M5 uses **3xtf32 (CUTLASS fp32-emulation)** for the decay-coupled + compounding state products; M5-chunked passes the SAME cases at `< 1e-7`. Judgment: + marginal tf32-vs-3xtf32 gap, benign at model level (KL green); 3xtf32 would add mma + count and deepen the perf NO-GO, so not pursued. + - **Engagement PROVEN:** `LLAMA_GDN_FLA_TRACE` fired `[gdn-fla] engage H=32 ...` in + `batched-bench`; nsys shows all six `gdn_fla::` kernels under + `LLAMA_GDN_FLA_CHUNK=1` and none under default. +- **Honest delta vs the +59.2 expectation.** Scope expected `~0-10 of the +59.2, "likely + a shared-hardware floor."` Delivered: **0 recovery, a -63 us/tok regression on the FLA + arm; the floor is confirmed.** M5's fused smem-resident chunked scan (56.31 us/tok) is + the winner and is at/near the GB10 memory-bandwidth floor for this op. What binds is + silicon (LPDDR5x bandwidth on the per-chunk h round-trip + the 99 KB smem cap forcing + the split), not the algorithm; it lifts only on datacenter Blackwell (HBM + larger + smem + TMEM), consistent with the scope's section-4 framing. + +Protocols honored: GPU lock held throughout and released; `LLAMA_MAX_BATCH_TOKENS` +unset; sm_121a; nsys `--cuda-graph-trace=node`; 3+ iter S_PP medians; no external +contention. WIP on the DGX fork topic branch `p5-fla-gdn` at +`2d64c37f08ad323038a44a89ab32189527c6ba29` (base `localai-paged` `653bb2f3d`, **NOT +pushed, NOT landed**): new `ggml/src/ggml-cuda/gdn-blocked-solve.cu` + narrow dispatch in +`gated_delta_net.cu` / `gated_delta_net.cuh`. Fork `localai-paged` HEAD **untouched at +`653bb2f3d`**; the LocalAI series **stays at 46 patches (`0001-0055`)**; topic branches +`p1-bf16-stream` / `p2-moe-region` / `p4-cbv2` left intact. Artifacts on the DGX +`~/bench/p5_fla_gdn/`: `killgate_20260702_204225/` (RESULTS.md, spp_control.txt, +spp_fla.txt, `nsys_{ctrl,fla}{2048,512}.{nsys-rep,kern.csv}`, GATES.txt, +`kl_moe_{ctrl,fla}.log`, occupancy.txt, gdn-blocked-solve.cu, p5_fla_test.cu) and +`standalone_20260702_203434/` (RESULTS.txt + p5_fla_test.cu, p5_m5_time.cu, +m5_kernel_body.cuh). + +## P6 fp8-e4m3 KV cache (final program phase) - NO-GO at the measured Stage-0b proxy; fp8/quant KV is a decode-THROUGHPUT dead end on GB10 hybrid-GDN, capacity-play stays open (recorded 2026-07-02) + +Sixth and final phase of the `EXECUTION_REARCH_SCOPE.md` additive program, and the **retry +that unblocked** the prior BLOCKED-ON-INFRA attempt (`ssh dgx.casa`, host `promaxgb10-4ad8`, +reachable throughout). The kill-gate **ran this time**: Stage 0a (measured nsys +`--cuda-graph-trace=node` decode ceiling) plus a **zero-code Q8_0-KV A/B proxy** for +Stage 0b. **Verdict: NO-GO for the throughput lever; the e4m3 kernel was correctly never +built.** See the "P6 RESULT" subsection in `EXECUTION_REARCH_SCOPE.md` for the full record. +Summary: + +- **STAGE 0a MEASURED CEILING (supersedes the analytical prior).** v1 difference-of-totals + was noise-dominated (prefill variance >> the 48-step decode delta -> INDETERMINATE); the + v2 per-kernel decode-isolation estimator (`~/bench/p6_ceiling_v2.py`) keeps only + ntg-scaling kernels and matches the batched-bench wall `t_tg` within 0.3%. fp8 halves KV + bytes, so theoretical-MAX decode saving = 0.5 x fa_KV-read_share (fa-only, honest): + **moe/dense std ctx512 x128 +2.16% / +3.44%; ctx4096 x8 +3.90% / +4.80%; ctx8192 x8 + +7.15% / +8.81%** (fa+gather upper bound tops at +10.48%). Only long context clears +3%; + the analytical prior (0.65% std, +17.34% ctx8192) is refuted in BOTH directions. +- **STAGE 0b MEASURED Q8_0 A/B PROXY (the decisive kill; 5 reps/arm, sigma 0.08-0.22%).** At + the highest-ceiling shapes: **dense ctx8192 x8 = +0.37% decode (flat; captures ~4% of the + +8.81% ceiling); moe ctx8192 x8 = -2.63% decode REGRESSION.** Even Q8_0 - the quant path + with the FAVORABLE integer DP4A fattn-vec dot - realizes ~none of the ceiling; dequant-in- + attention eats the KV-read BW saving (re-confirming the historical Q8_0 +7.8% null). +- **e4m3 IS STRICTLY WORSE THAN Q8_0 (structural, no build needed).** The fast quant-KV + fattn-vec path (`vec_dot_fattn_vec_KQ_q8_0`) wins on an int8xint8 DP4A dot; an e4m3 KQ path + cannot use DP4A (dequant->float then float-dot, strictly more expensive). e4m3's cheaper + hw-convert dequant does not touch the KQ product where Q8_0 already lands flat/negative. So + the Q8_0 proxy is a definitive disproof for e4m3; funding the e4m3 build to re-confirm a + stronger negative was declined. +- **HYBRID-GDN STRUCTURAL CAP.** Only 10 of 40 layers carry KV (30 GDN layers hold a + fixed-size recurrent state, no KV, ctx-independent), so fp8 can touch at most the 10/40 + slice - the reason flash-attn is a small decode fraction and the ceiling tops at +8.81%. +- **CAPACITY-PLAY STAYS OPEN.** As a **footprint** feature (not t/s), e4m3 KV halves the + 10/40 attention layers' KV bytes = a real long-ctx/high-concurrency capacity win; the + storage path already works today (`-ctk/-ctv q8_0` runs correctly on the paged binary at + small/zero decode cost on dense). Gate any future effort on footprint + per-path KL, NOT + on throughput. +- **DEFAULT PATH MEASURED GREEN (re-run this session).** Canonical greedy-md5 on the + byte-identical P6 binary (0 dirty vs `653bb2f3d`), paged: MoE `8cb0ce23`, dense `5951a5b4`. + +Provenance: fork `localai-paged` HEAD **untouched at `653bb2f3d`**; topic branch `p6-fp8-kv` +retained on the DGX (base `653bb2f3d`, the unmodified measurement worktree `~/llama-paged-p6`, +sm_121a), **NOT pushed**; LocalAI series stays at **46 patches (`0001-0055`)**; P3's +`p3-w4a16-direct` (`8eef7ba43`, WIP NO-GO, not landed to `localai-paged`) untouched. +Artifacts: `~/bench/p6_fp8_kv/{ceiling_20260702_215535,q8proxy_20260702_223414,md5gate}/` + +runners `p6_ceiling_v2.py`, `p6_q8proxy_ab.sh`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md new file mode 100644 index 000000000000..9edd32c73d57 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md @@ -0,0 +1,156 @@ +# llama.cpp patch series — paged attention (vLLM-parity engine) + +A **stacking** series: each patch is a small, self-contained, independently-buildable step toward an +in-model paged-attention engine. They apply in numeric order on top of the pinned `LLAMA_VERSION` +(`backend/cpp/llama-cpp/Makefile`). The build applies them automatically after checkout (see the +`llama.cpp:` target). Keeping the work as ordered patches — rather than one big diff — is what lets us +**rebase cleanly across llama.cpp bumps and avoid drift**: when a patch stops applying, only that small +patch needs fixing, and the failure points at exactly which step the upstream change touched. + +## Base + +- `LLAMA_VERSION` pin in `../Makefile`. **All patches are generated against that exact commit.** Bumping + the pin = re-run the regen workflow below and fix only the patches that no longer apply. + +## The series (phases → patches) + +| # | Patch | What | Verifies | +|---|-------|------|----------| +| 0001 | `0001-vendor-paged-kv-manager.patch` | Add `src/paged-kv-manager.{h,cpp}` (vLLM-parity block manager, CPU foundation) + CMake; no behavior change | builds; unit-tested separately | +| 0002 | `0002-paged-kv-storage.patch` | Shared block-pool KV tensor + `set_rows`-by-slot writes, behind `LLAMA_KV_PAGED` | builds; write/gather round-trip | +| 0003 | `0003-paged-gather-read.patch` | `build_attn_paged` gather-read in `llama-graph.cpp` | **Gate 0**: token-identical greedy gen, single + multi-seq | +| 0004 | `0004-paged-ondemand-alloc.patch` | On-demand block allocation via PagedKVManager | max concurrent seqs before OOM | +| 0005 | `0005-paged-continuous-batching.patch` | Block-granular admit/evict in the server slot path | tok/s vs concurrency, mixed-length | +| 0006 | `0006-paged-prefix-caching.patch` | Block-hash cross-request prefix dedup | TTFT + memory on shared prefixes | + +Each row is a separate `git commit` on the dev branch (below), exported 1:1 as a patch. Default off +(`LLAMA_KV_PAGED`) until Gate 0 (0003) is green, so partial series never changes stock behavior. + +## Regen workflow (the anti-drift recipe) + +```sh +# 1. check out the exact pin into a dev tree +git -C /tmp clone https://github.com/ggml-org/llama.cpp llama-dev && cd /tmp/llama-dev +git checkout +git checkout -b paged + +# 2. apply the current series (each becomes a commit), or develop the next patch +git am /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/00*.patch # or `git apply` + commit per patch + +# 3. iterate a phase as ONE commit, then export the whole series 1:1 +git format-patch ..paged -o /path/to/backend/cpp/llama-cpp-localai-paged/patches/paged/ --zero-commit -N + +# 4. on a pin bump: rebase `paged` onto the new pin; only conflicting patches need edits; re-export. +``` + +## Build integration + +The series is owned by this backend (`backend/cpp/llama-cpp-localai-paged`), not by the stock +`llama-cpp` backend, which is pure upstream. `../Makefile` (the paged wrapper) clones the pinned +`llama.cpp` via the copied stock build infra, then applies this series onto the cloned tree with the +same strict `git apply` the stock build uses for base patches: +``` +for p in $(PAGED_PATCHES_DIR)/0*.patch; do git apply --verbose "$p" || exit 1; done +``` +All variants (avx/avx2/avx512/cuda/…) clone + apply into their own build copy, so the series ships +everywhere without ever touching the stock `llama-cpp` source tree. + +## Latest mirror check + +Phase 37 re-verified the mirror invariant after adding patch `0063`: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +fork_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +``` + +The check used a fresh worktree at `LLAMA_VERSION`, applied every +`patches/paged/0*.patch` with strict `git apply`, staged the result, and compared +`git write-tree` to canonical fork branch `localai-paged` at +`2d590d770 feat(cuda): trace cublas tensor names`. + +Phase 69 re-verified that the committed LocalAI patch series still matches the +Phase37 fork tip, and then dry-ran the additive patch export needed for the +current local fork HEAD. No generated patch files were edited in Phase69 because +the repo policy requires pushing the fork branch before regenerating the LocalAI +series, and pushes still require explicit approval. + +Committed-series check: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +patch_tip_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_patch_tip=yes +match_fork_head=no +patch_count=54 +``` + +Dry-run export from `2d590d770..ea0875d14` produced ten source-only candidate +patches: + +```text +0064-feat-server-trace-serving-admission-batches.patch +0065-feat-server-add-admission-trace-histograms.patch +0066-feat-server-add-TTFT-prefill-first-scheduler-mode.patch +0067-feat-server-cap-TTFT-prefill-first-decode-deferral.patch +0068-feat-server-gate-TTFT-defer-by-prompt-backlog.patch +0069-test-cuda-cover-W4A16-direct-activation-policy.patch +0070-feat-cuda-route-W4A16-direct-activation-stub.patch +0071-feat-cuda-trace-layout-tensor-names.patch +0072-feat-cuda-trace-activation-quant-routes.patch +0073-feat-cuda-gate-BF16-cuBLAS-F32-output.patch +``` + +Projected-series check with current `0001..0063` plus temp `0064..0073`: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_plus_missing_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_fork_head=yes +current_patch_count=54 +missing_patch_count=10 +projected_patch_count=64 +``` + +Next mirror action after explicit push approval: + +1. Push `/home/mudler/_git/llama.cpp` branch `localai-paged` to + `fork/localai-paged`. +2. Regenerate or copy the equivalent source-only `0064..0073` patches from the + pushed fork. +3. Repeat the projected-series tree hash check above against fork HEAD before + committing generated patches. + +## Status + +- **0001 vendor manager — DONE.** Applies clean to the pin; builds into `libllama`. +- **0002 block placement — DONE + VERIFIED.** Built `llama-simple` at the pin; greedy generation is + **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B), paged branch confirmed firing. +- **0003 gather-read — DONE + VERIFIED (Gate 0 green).** Implemented in the **additive** form + (see `../README.md`): all logic in new `src/paged-attn.{h,cpp}` (a `llm_graph_input_i` gather-index + subclass + the K/V/mask gather), hooked by **one** line in `build_attn` + **two** thin accessors on + `llama_kv_cache_context` + 1 CMake line (216 insertions; no edit to `llm_graph_input_attn_kv` or + `llama-graph.h`). Greedy generation is **token-identical** stock vs `LLAMA_KV_PAGED=1` (Qwen3-0.6B, + **9/9** across 3 prompts × {32,96,128} tokens), with `n_gather=71 < n_kv=256` confirming real + compaction. Patch: `0003-paged-gather-read-env-LLAMA_KV_PAGED.patch`. + - **Key correctness finding:** `get_gather_idxs` must emit cells **sorted by token position**. The CPU + flash-attn online softmax reduces cells in physical-array order and is FP-order-sensitive, so 0002's + scattered placement *alone* (full-window read, no gather) diverges from stock once a sequence crosses + the first 16-cell block. The position-sorted gather reproduces stock's exact reduction order -> bit- + identical, not merely mathematically equivalent. So 0002 is the placement substrate; **0003 is what + makes paged placement token-identical under flash-attn.** +- 0004–0006 follow. + +### Honest parity note (important) + +This series delivers the paged-attention **engine** (capacity + scheduling + prefix sharing). It does **not** +by itself reach vLLM throughput parity, because the measured prefill bottleneck is the **FP4 MoE GEMM kernel** +(Lever 3: `mul_mat_q` ~22 TFLOP/s, ~27× behind vLLM) — a *per-token compute* gap that paging does not +touch. Paged attention closes the **concurrency/memory** gap (more sequences, prefix reuse); the prefill/throughput +gap additionally needs the tcgen05/CUTLASS grouped-GEMM (deferred, upstream-grade, no shortcut — see +`../README.md`). So full vLLM parity = this series **AND** the +kernel; neither alone suffices. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md new file mode 100644 index 000000000000..8dfa0ead3b78 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_RESULTS.md @@ -0,0 +1,76 @@ +# PREFILL_GEMM_RESULTS - option (a) dequant->bf16 cuBLAS, measured on GB10 + +Companion to `PREFILL_GEMM_SCOPE.md`. This records the GPU A/B for the #1 +prefill lever (route large-M NVFP4 dense GEMMs off FP4-MMQ onto dequant->bf16 +cuBLAS / nvjet). Shipped as patch `0033`, **default-off** because the measured +result is a regression on this hardware. + +Hardware: NVIDIA GB10 (sm_121), CUDA 13.0. Backend pin `9d5d882d`. +Models: `q36-27b-nvfp4.gguf` (dense), `q36-35b-a3b-nvfp4.gguf` (MoE). +Binary: `build-cuda/bin/llama-batched-bench -fa on -ngl 99`, `LLAMA_KV_PAGED=1`. +A/B is a single build toggled by `LLAMA_FP4_PREFILL_M` (0 = MMQ baseline, >0 = +route prefill M>threshold to bf16 cuBLAS), so it isolates exactly this lever. + +## 1. Bit-exact / numeric gate (PASS - divergence benign) + +| Gate | Result | +|---|---| +| `test-backend-ops -o MUL_MAT` (default, threshold off) | 1146/1146 pass | +| `test-backend-ops -o MUL_MAT_ID` (default) | 806/806 pass (MoE untouched) | +| `test-backend-ops -o MUL_MAT`, path FORCED (`LLAMA_FP4_PREFILL_M=64`) | NVFP4 large-M cases (m=2048/1600/2050, n=128, k=2048) green CUDA-vs-CPU | +| greedy md5, short prefill (< threshold), lever vs base | identical: `5951a5b4d624ce891e22ab5fca9bc439` (== documented dense reference; decode byte-untouched) | +| greedy md5, long prefill (> threshold, exercises bf16 path), lever vs base | identical: `5f3967df5781445feeb25762abb9eae7` (the new FP path flips no greedy argmax) | + +The new path (NVFP4->bf16 round, bf16 tensor cores, f32 accumulate) is a +different FP path from fused FP4xQ8_1 MMQ, but it is precision-neutral-to-better: +keeping activations in bf16 instead of Q8_1 is strictly more precise, and the +greedy output is byte-identical. This matches the scope's prediction +(KLD(dequant-bf16 || f16) <= KLD(FP4-MMQ || f16)). + +## 2. Performance (REGRESSION - the lever loses on GB10) + +S_PP (prefill tokens/s), q36-27b dense, A/B `LLAMA_FP4_PREFILL_M` off vs on: + +| prefill ubatch M | npl | base S_PP (MMQ) | lever S_PP (bf16 cuBLAS) | delta | +|---|---|---|---|---| +| 512 | 32 | 958.99 | 486.65 | -49% | +| 1024 | 8 | 1013.65 | 587.27 | -42% | +| 2048 | 8 | 918.46 | 649.42 | -29% | + +Default-off control (no env): S_PP 966.98 == base (within noise) -> the patch is +inert by default. + +## 3. Why it loses (the scope premise was wrong for GB10) + +The scope assumed FP4-MMQ is register-bound to ~3% of FP4 peak at large M, so a +vendor large-M kernel would win. **Measured, FP4-MMQ at M=512..2048 beats +dequant->bf16 cuBLAS by 29-49%.** Two compounding reasons: + +1. **bf16 tensor-core peak is ~half FP4 peak on GB10.** Even a perfect bf16 GEMM + caps at ~half the throughput the FP4-MMA path can reach. +2. **The dequant tax is an un-amortized memory pass.** Per prefill step the new + path reads FP4 weights (~0.5 B/elt), writes bf16 (2 B/elt), then the GEMM + reads bf16 (2 B/elt) = ~8x the weight byte traffic of the FP4-MMQ read + (~0.5 B/elt). The dequant write is M-independent, so it only amortizes as M + grows: the gap shrinks 49% -> 42% -> 29% from M=512 -> 2048 but never crosses + even at M=2048 (above the default n_ubatch). + +This is also consistent with the README decode finding that the dense path was +already ~96-97% of vLLM - the dense GEMM was never the bottleneck the way the +prefill ground-truth (measured on the MoE decision model) implied. + +## 4. Status of the phases + +- **Phase 1 (dense): REJECTED on GB10**, landed default-off as a validated, + env-gated scaffold (mechanism + bit-exact gate reusable by option (b) and by + non-GB10 hardware where bf16 may fare differently). +- **Phase 2 (MoE grouped large-M): NOT implemented.** It inherits the same + bf16-peak < FP4-peak ceiling plus a per-expert dequant, so a grouped + bf16-cuBLAS would regress for the same reason; the MoE id-path also has the + graph-safety catch (a false `should_use_mmq` falls to the host-sync sorted + loop, not CUDA-graph-safe). Not worth the multi-day grouped-cuBLAS + graph + work on a path the dense A/B already shows loses. +- **The only route to a real prefill GEMM win is option (b)** - a native + Blackwell FP4-MMA large-M kernel (multi-week), to greenlight only if the + prefill regime is funded. The committed scaffold gives option (b) its + M-threshold routing and its bit-exact gate for free. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_SCOPE.md new file mode 100644 index 000000000000..09ea55949a8b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/PREFILL_GEMM_SCOPE.md @@ -0,0 +1,264 @@ +# PREFILL_GEMM_SCOPE - large-M NVFP4 expert/dense GEMM (design only) + +**Status: DESIGN + PLAN ONLY. No kernel written, no GPU run in this pass.** +This scopes the #1 prefill lever for `llama-cpp-localai-paged`: the NVFP4 weight +GEMM at large M (prefill), where llama.cpp's `mul_mat_q` (MMQ) NVFP4 path is far +slower than vLLM's `marlin_moe_wna16` (MoE) + cutlass/nvjet (dense). Per the +prefill ground-truth that motivated this scope, the GEMM bucket is ~232 us/tok +(paged) vs ~68 us/tok (vLLM) - 3.4x slower, ~51% of the paged-vs-vLLM prefill +gap (164 us/tok). + +> **Regime warning (read first).** Every "GEMM is at the BW floor / ties vLLM" +> conclusion in `README.md` section 5 is a **DECODE** finding (M<=128, +> bandwidth-bound). This document is about **PREFILL** (large M, compute / +> tensor-core-throughput bound) - a different regime, which is exactly why the +> rejected "W4A16-Marlin MoE GEMM" lever is revisited here **for prefill only**. +> The 232/164/68 us/tok prefill bucket came from the prefill ground-truth that +> commissioned this scope and is **not** in a committed in-repo profile (the +> committed profiling - `GAP_PROGRESS.md` etc. - is decode-focused). Per the +> "profile-don't-assume" rule in `.agents/vllm-parity-methodology.md`, **step 0 of +> any build is to re-confirm the prefill GEMM bucket on GPU** (nsys, prefill-only +> window) before touching code. + +--- + +## 1. Why `mul_mat_q` is slow at large M (confirmed from source) + +Source: `ggml/src/ggml-cuda/mmq.cu`, `mmq.cuh` at this backend's pin (`9d5d882d`). + +MMQ is built for the **M<=128 decode tile**. Three structural facts from the code: + +1. **The M (column/token) tile is capped at 128.** + `get_mmq_x_max_host()` / `get_mmq_x_max_device()` (mmq.cuh ~108-140) return + `128` on Blackwell (`turing_mma_available(cc)`), and the host launch loop + (mmq.cuh ~4237) picks `mmq_x_best` only to *minimise the column-tile count for + `ncols_max`, never exceeding `mmq_x_max`*. So a prefill ubatch of M=512 (or + 4096) tokens is processed as many `mmq_x<=128` column-tiles. The compile-time + accumulator tile is `mmq_x`-wide; there is no large-M (e.g. 256-wide) tile + variant. The whole tile-selection machinery exists to pick a *small* tile for + *small* batches, not to grow for large ones. + +2. **The FP4-MMA kernel is register-bound to 1 CTA/SM.** + `mul_mat_q` for FP4 is `__launch_bounds__(warp_size*nwarps, min_blocks=1)` + (mmq.cuh ~3579-3585), i.e. 256 threads, 1 resident block/SM (~255 regs/thread). + The patch-0017 comment in-tree states this plainly: the kernel is + "REGISTER-bound to 1 CTA/SM ... the under-occupancy that strands the kernel at + ~3% of FP4 peak at M=128." At large M the work per tile is bigger, but with one + CTA/SM the tensor cores still stall on LPDDR5x / shared-memory weight loads + with no CTA-level latency hiding - the design has no async multi-stage global-> + shared pipeline (cp.async double-buffering) that large-M GEMMs need. + +3. **Per-tile fixed overheads amortise poorly only because the tile stays small.** + Each tile re-stages weights into shared memory, runs the `MMQ_ITER_K_FP4=512` + K-loop, and the activations are quantized to Q8_1 (`quantize_mmq_fp4_cuda`, + block_fp4_mmq = FP4 weights x int8 activations). For decode this is the right + trade (FP4 weight traffic is the bottleneck). For large-M prefill the GEMM is + compute-bound, so the right structure is big tensor-core output tiles (e.g. + 128x256), a deep async load pipeline, and full SM occupancy - exactly what + cutlass 3.x / nvjet (cuBLAS) and marlin implement and MMQ does not. + +Patch 0017 already proved every *cheap* large-tile/occupancy lever inside MMQ +(`GGML_CUDA_FP4_MMQ_Y`, `GGML_CUDA_FP4_MINBLOCKS`) is a no-win on GB10 - because +the limit is the small-tile kernel *structure*, not a tunable. To win at large M +you must leave MMQ for a large-M kernel. + +--- + +## 2. Options (feasibility / bit-exactness / effort) + +### Key enabling facts already in the tree + +- **NVFP4 -> bf16/f16 dequant kernels already exist.** `convert.cu` defines + `dequantize_row_nvfp4_cuda`; `ggml_get_to_bf16_cuda` / `ggml_get_to_fp16_cuda` + / `ggml_get_to_fp16_nc_cuda` all return it for `GGML_TYPE_NVFP4`. The + non-Blackwell fallback ("falls back to dequant", README s2) already uses this. +- **cuBLAS on GB10 dispatches to nvjet** (NVIDIA's JIT tensor-core GEMM) - the + committed profiles already show `nvjet lm_head` and `nvjet non-FP4 cublas GEMM` + rows. So a dequant->cuBLAS bf16 GEMM lands on a vendor-tuned large-M kernel for + free. +- **BUT NVFP4 is explicitly excluded from the tensor-core cuBLAS path.** In + `ggml_cuda_op_mul_mat_cublas` (ggml-cuda.cu ~1659) the `use_fp16` predicate + begins `src0->type != GGML_TYPE_NVFP4 && ...`. So if NVFP4 reaches cuBLAS today + it falls to the `else` branch: dequant to **F32** + `cublasSgemm` (**no tensor + cores**) - useless for prefill. Relaxing this one exclusion (route NVFP4 to the + bf16/f16 tensor-core branch, where `to_*_cuda(NVFP4)` already exists) is the + pivot that makes option (a) a few-line change rather than a kernel. + +### (a) Dequant -> cuBLAS/cutlass bf16 GEMM for large M -- RECOMMENDED + +Dequant the NVFP4 weights to bf16 (transient pool buffer) once per prefill step, +then a large-M tensor-core `cublasGemmEx` (CUBLAS_COMPUTE_32F accumulate, bf16 +inputs). Activations stay bf16 (not Q8_1-quantized). + +- **Feasibility: HIGH.** All pieces exist (dequant kernels, cuBLAS bf16 path, + pool allocator). The only code change for the dense path is (i) make + `ggml_cuda_should_use_mmq` return false for NVFP4 dense above an M threshold so + the dispatch falls through to `ggml_cuda_op_mul_mat_cublas`, and (ii) relax the + `src0->type != GGML_TYPE_NVFP4` exclusion so it dequants to bf16 and uses + `cublasGemmEx` tensor-core, not f32 Sgemm. +- **Cost model (the crux - why it wins ONLY at large M).** Dequant is one extra + weight-sized memory pass (read ~0.5B/elt FP4 + scales, write 2B/elt bf16). The + bf16 GEMM then reads weights as bf16 = **4x the byte traffic of the FP4-MMQ + read**. At small M (decode) this 4x weight traffic dominates -> bf16-cuBLAS + loses -> keep MMQ (this is why decode stays FP4-MMQ; consistent with the + README decode verdict). At large M the GEMM is compute-bound and weight traffic + is amortised over hundreds of columns, so the 4x is cheap and cuBLAS's mature + large tiles + async pipeline + full occupancy dominate MMQ's 3%-of-peak small + tile. The dequant pass itself is ~one weight-read amortised over the whole + prefill step - negligible at large M. +- **Honest ceiling.** GB10 bf16 tensor-core peak is ~**half** the FP4 tensor-core + peak. A bf16 cuBLAS GEMM at ~70-80% of bf16 peak is ~35-40% of FP4 peak. That + is a huge jump from MMQ's ~3% large-M utilisation, but it is **not** automatic + full vLLM parity (vLLM prefill uses 4-bit weight tiles, staying near FP4-class + throughput). Expect this to recover most, not all, of the 232->68 gap. See s4. +- **Bit-exactness: NEW FP path** (NVFP4->bf16 round, bf16 TC, f32 accumulate) vs + fused FP4xQ8_1 MMQ. **Not byte-identical** - gate per-path via KLD exactly like + the paged-MoE `8cb0ce23` precedent (README s5 / `PAGED_BITEXACT_NOTE.md`). It + should pass *easily and favourably*: keeping activations in bf16 instead of + Q8_1 is strictly more precise than the MMQ path, so KLD(dequant-bf16 || f16) + should be <= KLD(FP4-MMQ || f16). This is a precision-neutral-to-better change, + not a precision regression like the rejected lever 4. +- **Effort: LOW-MEDIUM (a few days).** Dispatch flip + exclusion relax + an M + threshold + the KL gate + a prefill bench. No new kernel. Dense first; MoE is + the harder follow-on (see (c)/plan). +- **Memory note.** Dequant into a *transient* pool scratch per step (do **not** + cache bf16 weights - a persistent bf16 copy is 4x VRAM for those tensors and + would erase the backend's "1.5-3x less memory" property). The per-step dequant + pass is the price of keeping the model FP4-resident. + +### (b) Marlin-style fused NVFP4 large-M MoE GEMM (port `marlin_moe_wna16`) + +Port vLLM's marlin grouped MoE kernel (4-bit weights, f16 activations, dequant- +in-register, async cp.async pipelines, swizzled layouts). + +- **Feasibility: LOW (hardest).** Marlin is a hand-tuned CUTLASS-class kernel and + is **not NVFP4-aware** (it targets wna16 group-quant, not NVFP4's 16-elt blocks + with ue4m3 micro-scales). You would either (i) adapt marlin to dequant NVFP4 + in-register and accumulate in f16 (abandoning native Blackwell FP4-MMA), or + (ii) write a brand-new Blackwell sm_121 FP4-MMA large-M kernel - which is + essentially re-implementing what cutlass 3.x / nvjet already give you via (a). +- **Bit-exactness:** new FP path, KL-gate (same as (a)). +- **Effort: HIGH (multi-week, high risk),** kernel + layout + Blackwell MMA + scheduling + graph-safety + the bit-exact gate. +- **Verdict: do NOT start here.** Its only structural advantage over (a) is 4-bit + weight traffic, which matters only when BW-bound = small M = **decode**, the + regime already rejected. At large M (a) reaches the same vendor large-M kernels + for ~1% of the effort. Keep (b) on the shelf as the *only* route to true 68 + us/tok parity if (a)'s bf16 ceiling proves insufficient and the win justifies a + multi-week kernel. + +### (c) M-threshold routing (the integration mechanism for (a)) + +Not an alternative to (a) - it is *how* (a) is wired. Keep FP4-MMQ for decode +(M<=threshold), switch to the large-M path for prefill. + +- **Cleanest hook:** `ggml_cuda_should_use_mmq(type, cc, ne11_or_ne12, n_experts)` + already receives M (`ne11` dense / `ne12` MoE tokens). Add an NVFP4+Blackwell + branch: return false when M > `LLAMA_FP4_PREFILL_M` (default e.g. 256-512, + env/`-D` tunable, default value chosen so default == today's behaviour until + validated). It is called from both `ggml_cuda_mul_mat` (~2573/2582) and + `ggml_cuda_mul_mat_id` (~2664), so one edit covers dense + MoE routing. +- **Dense fallthrough is clean:** `ggml_cuda_mul_mat` final `else` -> + `ggml_cuda_op_mul_mat(..., ggml_cuda_op_mul_mat_cublas, ...)` -> with the + exclusion relaxed, dequant->bf16->`cublasGemmEx`. Works. +- **MoE fallthrough is NOT clean (the catch):** in `ggml_cuda_mul_mat_id`, a + false `should_use_mmq` falls to `should_use_mmf` (no NVFP4 support) then to the + **host-side sorted per-expert loop** with a `cudaStreamSynchronize` (ggml-cuda.cu + ~2700) - slow and **not CUDA-graph-safe** (it would break the MoE re-graph, + patch 0025). So MoE large-M needs a *dedicated graph-safe grouped GEMM* (dequant + the expert-gathered weights to bf16 + `cublasGemmGroupedBatchedEx`, CUDA 12.5+, + over the existing `expert_bounds`/`ids_dst` sorted layout), not a bare + fallthrough. This is why the plan ships **dense first, MoE second**. + +--- + +## 3. Recommended approach + implementation plan + +**Recommendation: (a) dequant->bf16 cuBLAS, wired via (c) M-threshold routing, +dense-path first, MoE grouped-cuBLAS second. Reject (b).** + +### Phase 0 - confirm the bucket on GPU (no code) +- nsys prefill-only window (`-npp -ntg 0/1`, exclude the graph-capture + step) on q36-27b dense and q36-35b-a3b MoE at the backend pin. Confirm the + NVFP4 `mul_mat_q` / `mul_mat_id` bucket is ~232 us/tok and that it is + compute-bound at prefill M (check tensor-core active % low, not BW-saturated). + If the bucket is not what the ground-truth claims, stop and re-scope. + +### Phase 1 - dense large-M NVFP4 -> bf16 cuBLAS (the bankable win) +Files / edits: +1. `ggml/src/ggml-cuda/mmq.cu` - `ggml_cuda_should_use_mmq`: add + `if (type==GGML_TYPE_NVFP4 && blackwell_mma_available(cc) && ne11 > LLAMA_FP4_PREFILL_M && n_experts==0) return false;` + (n_experts==0 = dense only in Phase 1). Default threshold == effectively + disabled until A/B-validated, env/`-D` overridable (mirror the 0017 + `GGML_CUDA_FP4_*` knob style + in-tree comment). +2. `ggml/src/ggml-cuda/ggml-cuda.cu` - `ggml_cuda_op_mul_mat_cublas`: relax the + `src0->type != GGML_TYPE_NVFP4` guard in `use_fp16` (prefer a dedicated bf16 + branch: NVFP4 -> `ggml_get_to_bf16_cuda` -> `cublasGemmEx` CUDA_R_16BF / + COMPUTE_32F, matching the existing BF16 src0 branch for best accuracy). +3. Transient pool scratch for the dequanted weights (reuse `ggml_cuda_pool_alloc` + as the existing branch does; no persistent allocation). + +### Phase 2 - MoE grouped large-M (the harder, higher-value follow-on) +1. New grouped path reached from `ggml_cuda_mul_mat_id` when + `should_use_mmq`==false for NVFP4+large-M+`n_experts>0`: dequant the + expert-gathered weights to bf16 and run `cublasGemmGroupedBatchedEx` over the + existing `expert_bounds` / `ids_dst` sorted layout that `mul_mat_q` already + builds. Reuse the patch-0023 de-dup'd activation gather where applicable. +2. **Must stay CUDA-graph-safe** - no host sync (do not fall into the legacy + sorted loop). Validate the MoE re-graph (patch 0025 / `LLAMA_MOE_FORCE_GRAPHS`) + still captures. + +### The bit-exact / KL gate (both phases) +- Greedy md5 on the standard prompt (README s5) to detect *unexpected* divergence + on the non-prefill paths (must stay == the per-path reference: dense + `5951a5b4`, paged-MoE `8cb0ce23`). The large-M path itself will differ -> gate + it by KLD vs the f16 reference, requiring `KLD(new||f16) <= KLD(FP4-MMQ||f16)` + and PPL within the established band, recorded in `PAGED_BITEXACT_NOTE.md`. +- `test-backend-ops` MUL_MAT / MUL_MAT_ID at NVFP4 **prefill shapes** (large M) + CUDA0-vs-CPU, plus the existing decode shapes to prove decode is byte-untouched + (default threshold keeps decode on MMQ). + +### The bench +- `llama-batched-bench -fa on -ngl 99` reporting **S_PP** (prefill t/s), swept + over prefill length and `npl`, A/B with `LLAMA_FP4_PREFILL_M` off vs on, dense + and MoE, vs stock and vs the vLLM prefill reference. Per-lever A/B discipline + (`.agents/vllm-parity-methodology.md`): one knob at a time, record the rejected + threshold values too. + +--- + +## 4. Honest risk + expected speedup + +- **Phase 1 (dense) is a tractable routing change, not a kernel project** - days, + low risk. It reuses existing dequant kernels and the existing nvjet/cuBLAS + large-M path; the net new code is a threshold + a one-line exclusion relax + a + KL gate. +- **Phase 2 (MoE) is medium risk** - the grouped-batched cuBLAS wiring + + CUDA-graph-safety is real work (the bare fallthrough is a slow, graph-breaking + host loop), but still far short of a from-scratch kernel. +- **Will the GEMM bucket hit 232 -> ~68 us/tok (full vLLM parity)? Honestly, no - + not from bf16-cuBLAS alone.** bf16 tensor-core peak on GB10 is ~half FP4 peak, + so the realistic floor for a dequant->bf16 GEMM is ~**90-130 us/tok** (roughly + 35-45% of FP4 peak at ~70-80% of bf16 peak). That recovers ~**60-75%** of the + 232->68 bucket gap = a large prefill win (the GEMM is ~51% of the total prefill + gap, so closing ~two-thirds of it is a meaningful S_PP improvement), but it + leaves a residual. **True 68 us/tok parity requires a native FP4-MMA large-M + kernel (option (b)) - the multi-week project** to greenlight only if Phase 1's + measured win proves the prefill regime matters enough to fund it. +- **Recommendation:** build Phase 1, measure, and let the measured dense S_PP + gain decide whether Phase 2 (MoE grouped cuBLAS) and ultimately (b) (native FP4 + large-M kernel) are worth funding. Bank the cheap two-thirds before paying for + the kernel. + +--- + +## 5. Summary table + +| Option | Feasibility | Bit-exact | Effort | Verdict | +|---|---|---|---|---| +| (a) dequant->bf16 cuBLAS large-M | HIGH (parts exist) | new FP path, KL-gate (likely better PPL) | LOW-MED (days) | **RECOMMENDED** (dense first) | +| (b) Marlin/native FP4 large-M kernel | LOW | new FP path, KL-gate | HIGH (multi-week) | shelf - only route to true 68 us/tok | +| (c) M-threshold routing | HIGH | n/a (mechanism) | LOW | **the wiring for (a)** | + +Decode is untouched by all of the above (threshold keeps M<=128 on FP4-MMQ); this +is a **prefill-only** lever. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md new file mode 100644 index 000000000000..2331691ef917 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_BUILD_PLAN.md @@ -0,0 +1,628 @@ +# Tensor-Core GDN Build Plan + +> Auto-generated from the GDN build-design workflow. Build-ready spec for the full tensor-core chunked-scan kernel (2nd prefill lever). + +## 1. Remaining intra-chunk products -> mma mapping + +I have everything needed: the exact chunked-scan math from patch 0031, the sm_121a constraints from the scope doc, and the concrete ggml tf32 fragment (`mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32`, `tile<16,8,float> D, tile<16,8,float> A, tile<8,8,float> B`) at `mma.cuh:976-984`. Here is the design. + +--- + +# Tensor-core mapping of the REMAINING intra-chunk GDN products (patch 0031 steps 3-7) + +## 0. Building block + what the PoC already covered + +**Grounding.** Math: `backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch` (steps reproduced inline below). Scope/constraints: `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`. Fragment API: `ggml/src/ggml-cuda/mma.cuh:976-984` (the only f32-accumulate tf32 overload on sm_121a). + +The single warp-level primitive on sm_121a is **`m16n8k8` tf32 / f32-accumulate**: +- `A` fragment = `tile<16,8,float>` (M=16, K=8; 4 floats/thread, `Axi[0..3]`) +- `B` fragment = `tile<8,8,float>` (K=8, N=8, `.col` operand; 2 floats/thread, `Bxi[0..1]`) +- `D` accumulator = `tile<16,8,float>` (M=16, N=8; 4 floats/thread) +- A GEMM `[M×K]·[K×N]` tiles to `ceil(M/16) × ceil(N/8) × ceil(K/8)` mma calls, f32-accumulating over the K-subtiles. +- bf16 alternative `m16n8k16` (`mma.cuh:1064`, K=16/mma, 7-bit mantissa) exists but is **only** admissible for the tf32-safe Gram class — never the state/decay-coupled class. +- 3xtf32 ladder = split each f32 operand into 3 tf32 limbs, run 3 limb-products per K-subtile (hi·hi, hi·lo, lo·hi), accumulate high→low. ~3x the mma count, ~f32 accuracy. + +**PoC covered products 1 + 2** (the two `C×C` Gram products, both tf32-safe, NMSE ~3e-9): `KK[t,t']=k_t·k_t'` → `A`, and `QK[t,t']=q_t·k_t'` → `P`. Both are `(C×dk)·(dk×C)`, M=C N=C K=dk=128, decay+beta applied in f32 after. They already share the `Kc^T` B-fragments. + +The remaining families are **steps 3,4,5,6,7**. Notation: `C` = chunk (default 64; PoC 16), `dk=dv=128`, per `(head,seq)` block. Tile counts below are for **C=64**. + +--- + +## 1. Per-product mma mapping table (the deliverable) + +| # | Product (0031 step) | Result = matmul | M | N | K | mma tiles `(M/16)·(N/8)·(K/8)` @C=64 | Accumulation order | Precision class | Shares staged operand with | +|---|---|---|---|---|---|---|---|---|---| +| 1 | `KK→A` (PoC) | `Kc · Kcᵀ` | C | C | dk=128 | 4·8·16 = **512** (~½ tri) | over 16 k-subtiles | **tf32-safe** (proven) | `Kcᵀ` B-frag ↔ P2; `Kc` LHS ↔ P3 | +| 2 | `QK→P` (PoC) | `Qc · Kcᵀ` | C | C | dk=128 | 4·8·16 = **512** (~½ tri) | over 16 k-subtiles | **tf32-safe** (proven) | `Kcᵀ` B-frag ↔ P1; `Qc` LHS ↔ P4 | +| 3 | `KS = S0ᵀk_t` | `Kc · S0` | C | dv=128 | dk=128 | 4·16·16 = **1024** | 16 k-subtiles, limbs hi→lo | **3xtf32 / f32** (state-boundary, feeds solve) | `S0` B-frag ↔ P4; `Kc` LHS ↔ P1 | +| 4 | `QS = S0ᵀq_t` | `Qc · S0` | C | dv=128 | dk=128 | 4·16·16 = **1024** | 16 k-subtiles, limbs hi→lo | **3xtf32 → demote-first** (×γ_t≤1 attenuated) | `S0` B-frag ↔ P3; `Qc` LHS ↔ P2 | +| 5 | `O += P·U` | `P · U` | C | dv=128 | C=64 | 4·16·8 = **512** (~½ tri over K) | C/8 k-subtiles, triangular | **tf32-safe** (P decay-masked & bounded in f32 first) | `P`(=Amat) ↔ P2; `U` B-frag ↔ P6 | +| 6 | `S_C += Kᵀ(D·U)` | `Kcᵀ · DU` | dk=128 | dv=128 | C=64 | 8·16·8 = **1024** | scale state by γ_last (f32) **first**, then C/8 k-subtiles, limbs hi→lo | **3xtf32 / f32** (THE cross-chunk carry, compounds over n_tok/C) | `U` B-frag ↔ P5; `Kc` (transposed) ↔ P1/3 | +| 7 | `U = A⁻¹·RHS` off-diag coupling `A_ij·U_j` | `A_ij · U_j` | b=16 | dv=128 | b=16 | 1·16·2 = **32**/pair → **~192** (6 pairs) +~128 diag | forward sweep i=0..C/b; off-diag subtractions before diagonal solve | **tf32-safe off-diag + f32 in-register `16×16` diagonal** | `A`(=Amat) ↔ P1; `U` blocks ↔ P5/P6 | + +3xtf32 inflation if the ladder is taken: P3 1024→**3072**, P4→**3072**, P6 1024→**3072**. + +--- + +## 2. Per-product detail (the 5 remaining families) + +### Product 3 - `KS = S0ᵀ k_t` (RHS state-boundary term) +0031: `ks = Σ_i Sd[j·dk+i]·Kc[t·dk+i]`; feeds `RHS[t][j] = β_t(v_t[j] − γ_t·ks)`. +- **As a GEMM:** `KS[t][j] = Σ_i Kc[t][i]·S0[i][j]` ⇒ `KS = Kc[C×dk] · S0[dk×dv]`. **M=C, N=dv=128, K=dk=128.** Contraction over the state-row index `i`. +- **Schedule:** `Kc` is the LHS (M-major over `t`, K over `i`) — already staged for P1. `S0` is the B operand, K-major over `i`, N over `j`. The patch's `Sd[j·dk+i]` layout (i contiguous for fixed j) **is already a K-major B layout** → `ldmatrix`-friendly as `tile<8,8>` B fragments. Accumulate 16 k-subtiles into f32 D. +- **Precision: 3xtf32/f32.** This is a state-boundary product: `S0` carries the full sequence history (wide dynamic range), and the result is *differenced* against `v_t` then fed into the solve, so error here propagates through `U` into both `O` and `S_C`. Default to the 3xtf32 ladder; A/B a plain-tf32 demote only after P4. + +### Product 4 - `QS = S0ᵀ q_t` (γ cross-chunk `O` term) +0031: `qs = Σ_i Sd[j·dk+i]·Qc[t·dk+i]`; `o = γ_t·qs + Σ P·U`. +- **As a GEMM:** `QS = Qc[C×dk] · S0[dk×dv]`. **M=C, N=dv=128, K=dk=128** — identical shape to P3. +- **Schedule:** identical to P3 but LHS=`Qc` (shared with P2). **Fuse with P3 on the shared `S0` B-fragments:** stage `S0` once as B, run `Kc·S0` then `Qc·S0` back-to-back — `S0` is the heavy operand (128×128) and is loaded once for both. +- **Precision: 3xtf32 but the demote-first candidate.** The term is scaled by `γ_t ≤ 1` in f32 after the mma, so when the chunk has decayed (`γ_t→0`) the absolute error is attenuated. Least sensitive of the three state-boundary products; it is the first to try at plain tf32 in the precision A/B. + +### Product 5 - `O += P · U` (attention-weighted output) +0031: `o += Amat[t·Cc+tp]·Ud[j·C+tp]` for `tp≤t`. +- **As a GEMM:** `O[C×dv] += P[C×C] · U[C×dv]`. **M=C, N=dv=128, K=C=64.** Contraction over the chunk index `t'`. +- **Schedule:** `P` (=Amat scratch from P2, with `d(t',t)` applied in f32) is LHS (M over t, K over t'); `U` (solved, in `Ud`) is the B operand, K-major over t'. `P` is **lower-triangular** ⇒ for M-tile `m` only K-subtiles `≤ m` are non-zero → ~½ the mma. Accumulate `C/8` k-subtiles. Add the `γ_t·QS` term (P4) into the same f32 D accumulator before write-out. +- **Precision: tf32-safe.** `P = d·QK` with `d≤1` is formed and bounded **in f32 first** (strong-decay rows already underflowed to ~0), so down-casting the bounded `P` to tf32 for this mma is benign. The decay is never inside the accumulation — it is pre-baked in f32, preserving the bounded de-gating invariant. + +### Product 6 - `S_C += Kᵀ(D·U)` (the state update) +0031: `s = γ_last·Sd[j·dk+i] + Σ_t d(t,last)·Kc[t·dk+i]·Ud[j·C+t]`. +- **As a GEMM:** let `DU[t][j] = d(t,last)·U[t][j]` (D=diag applied in f32). `S_C[i][j] += Σ_t Kc[t][i]·DU[t][j]` ⇒ `S_C[dk×dv] += Kcᵀ[dk×C] · DU[C×dv]`. **M=dk=128, N=dv=128, K=C=64.** Contraction over the chunk index `t`. +- **Schedule:** the accumulator D fragments **are the register-resident state** that persists across the chunk loop. Order is strict: (i) scale the state fragments by `γ_last` in f32 in-register, **then** (ii) mma-accumulate `Kcᵀ·DU` over `C/8` k-subtiles into them. LHS = `Kc` read **transposed** (i as M-row, t as K) — a different fragment view of the same `Kc` smem buffer (use the `ldmatrix` transpose / J-major tile view). B = `DU` = `U` scaled by `d(t,last)` in f32, K-major over t — **same `U` B-layout as P5**. +- **Precision: 3xtf32 / f32 — the strongest ladder candidate.** This is the only product whose error *compounds across all `n_tokens/C` chunk steps*; it defines the state trajectory. Keep at 3xtf32 longest; this is the last product to ever consider demoting, and the place where a full-f32 accumulate (3xtf32) is most justified even if everything else passes plain tf32. + +### Product 7 - the A-inverse (blocked forward substitution, FLA UT-transform) +0031 does a serial per-thread fwd-subst. Tensor-core form (block `b=16` = one mma M-tile, `C/b=4` blocks at C=64): +- For block `i`: `U_i = Ainv_ii·(RHS_i − Σ_{j`), but P3/P4 need it as a **B operand** (`tile<8,8>`, K-major over `i`). These fragment layouts differ, so at chunk entry the state must be re-laid-out accumulator→B. Cheapest correct path: bounce the 128×128 state through a transient smem tile (write D fragments, `ldmatrix` back as B fragments) once per chunk — `n_tokens/C` times total, negligible vs the `C`x state-BW saved, but it means the "freed 64KB" needs a *transient* state-shaped smem tile at boundaries (not resident across the intra-chunk work). With dv-slabbing it's per-slab (`dk×dv_tile`). + +2. **`Kc` needs two fragment views.** P1/P2/P3 read `Kc` contracting over `i` (dk); P6 reads `Kc` contracting over `t` (transposed, i as M-row). One smem buffer, but P6 must use the `ldmatrix` transpose / J-major `tile` view — budget for the transposed load, don't assume one staging serves both. + +Ordering within a chunk is already correct in 0031 and must be preserved by the tensor-core version: P3,P4 read **pre-update** `S0` → P7 solve → P5 → **P6 overwrites** `S0`→`S_C`. Single accumulator, read-first/write-last, no state double-buffer needed. + +## 2. A-inverse solve (form-T then apply, FLA UT transform) + +I have everything I need: the chunked GDN math (patch 0031), the scope doc's recommendation, the ggml tf32 `mma.sync` API (`mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)` = m16n8k8), and the proven Gram PoC layout (`g=lane>>2, t=lane&3`; tf32 NMSE ~3e-9). Here is the design. + +--- + +# A-inverse solve on `mma.sync` tensor cores (C=64, sm_121a) - design + +Notation: `C=64`, head dim `dk=dv=128`, block size `b=16` (= one `m16n8k8` m-tile), `n_b=C/b=4`. `A = I + N`, `N = tril(beta_t·d(t',t)·(k_t·k_t'), -1)` strictly-lower (nilpotent, `N^C=0`); `RHS[t][j] = beta_t(v_t[j] - gamma_t(S0^T k_t)[j])` is `C×dv`; we want `U = A^{-1}·RHS`. + +## 0. Core decision: form `T=A^{-1}` explicitly, then one wide apply (not direct back-subst) + +Two routes were on the table. **Form `T = A^{-1}` in the `C×C` domain (FLA "UT transform"), then `U = T·RHS` as a single tf32 GEMM** - rather than blocked forward-substitution applied directly to the `C×dv` RHS. Reasons, all decisive on this part: + +1. **Confines the only triangular dependency to the cheap `C×C` domain.** The expensive `dv=128`-wide work (`U=T·RHS`) becomes a dependency-free dense GEMM. The serial part is just the tiny `T`-formation. This is the single most important move for "don't serialize the warps." +2. **Fewer serial passes vs `dv`.** Inverting the `16×16` diagonal block once = a 16-column solve. Direct-solving against `RHS` re-solves against all `dv=128` columns per block. Form-`T`-once + reuse via mma is far cheaper in serial work. +3. **dv-slab reuse (the occupancy lever).** `T` depends only on `K`, not on `dv`. Form once, reuse for every `dv`-slab's `T·RHS_slab` apply. (Improvement over the scope's conservative "recompute per slab": when single-block, `T` lives in 16KB shared and is broadcast; only when dv-slabbing across separate blocks for occupancy do we recompute - which is cheap anyway, ~12% of the apply's mma count.) +4. **Isolates the error amplifier.** All recursion (the part that "amplifies error") lives in the small `T`-formation where 3xtf32 is nearly free; the bulk apply is a single well-conditioned GEMM. + +This still **is** the scope's "blocked forward substitution: in-register diagonal solves + mma off-diagonal coupling" - just organized to produce `T` explicitly so the wide apply is dependency-free. + +## 1. Solve algorithm + +Block-partition `A` into a `4×4` lower-triangular grid of `16×16` blocks. `A_ii = I_b + N_ii` (unit-lower-tri, `N_ii` strictly-lower nilpotent); `A_ij` (i>j) full `16×16`. `T=A^{-1}` is block-lower-tri with: + +``` +T_ii = A_ii^{-1} (diagonal block inverse) +T_ij = -A_ii^{-1} · ( Σ_{m=j}^{i-1} A_im · T_mj ) for i > j (block fwd subst) +``` + +Then `U = T·RHS`, with `U_i = Σ_{j≤i} T_ij·RHS_j`. + +**Phase D - diagonal inverses (4 blocks, fully parallel).** Each `A_ii` is `16×16` unit-lower-tri. Invert **exactly in f32** via shared-memory column-parallel forward substitution: stage `A_ii` to shared; thread `c` (c=0..15) solves `A_ii x = e_c` (`x_c=1`, `x_r = -Σ_{m=c}^{r-1} A_ii[r][m]·x_m`), writes column `c` of `T_ii`. 16 columns in parallel, ≤16 serial MACs each, all 4 blocks on 4 warps simultaneously. **No tensor cores here, and no reduced precision** - this is where the strongest coupling lives (see §4). + +**Phase O - off-diagonal, mma.** For each i>j: accumulate `P_ij = Σ_m A_im·T_mj` (δ block-products), then `T_ij = -T_ii·P_ij`. All on `mma.sync` (`16×16×16` = `2 n-tiles × 2 k-steps` = 4 m16n8k8 per block-product). + +**Apply.** `U = T·RHS`: warp `w` owns output rows `16w..16w+15`, sweeps all `dv=128` (16 n-tiles) × `C=64` (8 k-steps) = 128 m16n8k8/warp. This is the bulk and is embarrassingly parallel. + +`A`, `P` (the QK term), `RHS`, and `T` are all assembled from tf32 Gram mma's (`KK`,`QK`,`KS`,`QS` - the PoC-proven step-1/2 plus step-3/4) with **all decay/`gamma`/`beta` applied in f32 outside the mma** (preserves bounded de-gating). + +## 2. Tile schedule - keeping the triangular dependency off the warps + +Block = 128 threads = 4 warps; **"warp == 16-row m-tile" throughout** (same mapping as the PoC's C=64 kernel, `rowbase = warp*16`, `g=lane>>2`, `t=lane&3`). Three layered mechanisms keep the warps busy despite the triangular dependency: + +**(a) Wavefront (anti-diagonal) parallelism in `T`-formation.** The 6 off-diagonal blocks have a critical path of only `n_b-1=3`, not 6. Group by distance `δ=i-j`: + +| Wave | Blocks (δ) | count | depends on | mapped to | +|---|---|---|---|---| +| D | (0,0)(1,1)(2,2)(3,3) | 4 | - | 4 warps ‖ | +| 1 | (1,0)(2,1)(3,2) | 3 | D | 3 warps ‖ | +| 2 | (2,0)(3,1) | 2 | D,1 | 2 warps ‖ | +| 3 | (3,0) | 1 | D,1,2 | 1 warp | + +Within each wave all blocks are independent → one block per warp, no intra-wave serialization. Critical path = 4 dependency levels. Total `T`-formation mma: ~10 accumulation block-products + 6 inverse-applies = ~16 block-products × 4 = **~64 m16n8k8**, vs the apply's **512** (128/warp × 4) - so `T`-formation is ~12% of apply width and carries the only dependency. + +**(b) Confinement.** Because we form `T` then apply, the dependency-laden work is the ~64-mma `C×C` formation; the 512-mma `dv`-wide apply has zero triangular dependency. The serial chain never touches the throughput-critical width. + +**(c) Latency hiding via RHS overlap.** `T` depends only on `K` (→ `A` ← KK Gram). `RHS` depends on `V` and `S0^T k` (KS Gram, `dv`-wide, the expensive RHS term) and is **independent of the solve**. Schedule the wavefront `T`-formation (cheap, short critical path) concurrently with the `dv`-wide KS/QS Grams that build `RHS` and the `O` cross-term. The Phase-D shared scalar inverse (~16 shared round-trips × 4 warps) hides entirely under the KS Gram (thousands of cycles). By the time `T` is ready, `RHS` is staged and the apply fires immediately. + +**Shared/register budget (C=64, state register-resident per the scope):** + +| Buffer | bytes | note | +|---|---|---| +| `Kc`,`Qc` (bf16/tf32-staged) | 16KB+16KB | Gram operands | +| `A`→`T` scratch (`C×C` f32, in place) | 16KB | `A` consumed into `T`; reuses scope's A/P slot | +| `RHS`/`U` (`C×dv`) | 16-32KB | bf16 for the P·U and KᵀU mma's | +| diag-inverse scratch | ~1KB | `16×16` per warp, transient | +| gates `cs/gam/beta` | <1KB | f32 | +| state `S` | 0 (registers) | frees the 64KB that forced 0031's C=16 | + +Total ~65-80KB, under the 99KB opt-in - the solve adds **no** net shared pressure (T overwrites A; diag scratch is transient). Per-thread diag-inverse needs ~16 regs (one column of `x`), released before the apply - does not compound the already-heavy state-accumulator register budget. + +## 3. Precision risk assessment + +**Error model.** `‖ΔU‖/‖U‖ ≲ κ(A)·(‖ΔA‖/‖A‖ + ‖ΔRHS‖/‖RHS‖) + ‖Δ_apply‖/‖U‖`. The inverse is the amplifier; `κ(A)` is data-dependent. For DeltaNet, keys are L2-normalized so `|k_t·k_t'|≤1` ⇒ `|N[t][t']|≤beta_t≤1`; in the decaying regime `‖N‖<1` and `κ` is modest, but in the weak-decay/aligned-keys corner `κ` grows and the `δ=3` column path (`T_30`) compounds 3 multiplies. tf32 input rounding is ~`2⁻¹¹`≈`5e-4` relative (f32-accumulate; PoC measured Gram NMSE ~`3e-9`). 3xtf32 (3-limb split, the CUTLASS fp32-emulation trick) buys ~f32 (~`1e-7`) at ~3× that step's mma cost. + +**Where the strong coupling actually sits (the key structural fact):** the *inverse* `T_ii` is computed f32-exact, **but the dominant near-diagonal mixing is applied in the tf32 apply GEMM** (`U_i ⊃ T_ii·RHS_i`), and block-boundary adjacent pairs (e.g. tokens 15↔16) live in the `δ=1` off-diagonal `T_10`. So "f32 protects the strong coupling" is only true for the inverse *computation*; its *application* is tf32 unless promoted. This drives the ladder. + +**Precision config + 3xtf32 ladder (mandatory vs optional):** + +| Step | Default | Mandatory? | 3xtf32 cost | +|---|---|---|---| +| Diagonal inverse `T_ii` | **f32 (shared scalar)** | **Mandatory-and-free** (it's already f32) | n/a | +| Off-diag coupling `A_im·T_mj`, `T_ii·P_ij` | **3xtf32 (default-on)** | Effectively mandatory; ~3× of ~64 tiny mma = **negligible** | free insurance | +| KK/QK Gram → A,P | tf32 | optional (rung 1) | 3× of C×C Grams (cheap) | +| Apply `U=T·RHS` | tf32 | optional (rungs 2-4) | up to 3× the bulk | +| KS/QS Gram → RHS, O | tf32 | optional (rung 5) | vLLM keeps these bf16 (L4-rejected precedent) | + +Decays/`gamma`/`beta` **always f32, outside the mma** - invariant, not a rung. + +**Ladder ordering if the default config misses the KL-gate (cheapest → most expensive):** +1. KK Gram (feeds `A`) → 3xtf32 [cheap, C×C]. +2. Apply **block-diagonal terms only** `T_ii·RHS_i` → 3xtf32 [≈+0.8× apply; protects within-window strong coupling - mixed-precision-by-distance]. +3. + apply `δ=1` off-diagonal terms → 3xtf32 [covers block-boundary adjacent pairs]. +4. Full apply → 3xtf32 [≈+2× apply; expensive escape hatch]. +5. KS/QS Gram → 3xtf32. +6. Fall back to direct blocked back-substitution against RHS in 3xtf32 (the alternative route, slightly more accurate than form-`T`-then-multiply at the cost of the parallelism), else keep 0031's serial path. + +**Adversarial `g∈[-20,-1e-4]`:** strong decay ⇒ `d=exp(big-negative)→0` ⇒ off-diagonal `N→0` ⇒ `A≈I`, `T≈I`, apply≈identity, tf32 error vanishes; bounded de-gating (f32) guarantees underflow-to-zero, never inf. Weak decay (`g→0`) ⇒ `d≈1`, `A` well-conditioned, tf32's 8-bit exponent (vs f16's 5) holds the `gamma` dynamic range. The dangerous middle is the only KL-empirical risk - re-run this op case explicitly per the scope. + +**KL impact / gating.** Same gate as the backend's new-FP-path precedents: NMSE is expected to *fail* at reduced precision (this is a new path on a new path) - **the binding gate is KL** (`KLD(tc‖f16) ≤ KLD(seq‖f16)` + PPL band) plus greedy-md5 stability (md5 will not match 0031's serial path - per-path, validated benign). Expectation: the **default config (f32 diagonal + 3xtf32 off-diagonal-coupling + tf32 everything-else)** clears the KL-gate, because (i) the dominant apply matches the PoC Gram's `~3e-9`/tf32-input grade and (ii) the recursion-amplified `C×C` work is f32-grade for free. The expensive apply-3xtf32 rungs are reserved escapes. Worst case all-3xtf32 ≈ 3× the mma cost - still an order of magnitude under 0031's serial-f32 reductions and still net-positive given the `~C×` state-BW cut. + +## 4. Integration + validation + +- Build on `ggml/src/ggml-cuda/mma.cuh`: the tf32 path is `mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)` → `mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32` (line ~1089), gated by `AMPERE_MMA_AVAILABLE` (sm_121-correct). tf32 operands stage to shared and load via `load_generic` (or the PoC's `cvt.rna.tf32.f32` register packing); `ldmatrix` is `.b16`-only so it is **not** usable for tf32 fragments - use `load_generic`. 3xtf32 = 3 limb-split tf32 mma's + f32 recombine, reusing the same tile type. +- Slots into patch 0031's `gated_delta_net_chunked_cuda`: replace the serial `KK`/`QK` Grams (lines 183-198, 222-235) with the PoC tiles; replace the serial forward-subst (lines 211-218) with this Phase-D + Phase-O `T`-formation; replace the serial `O`/`S_C` reductions (steps 5/6) with the apply + `KᵀU` mma. `C=16`→`C=64`, state register-resident. +- Validation step before the full kernel: extend `~/scratch_tc_gdn_poc/gdn_gram_bench.cu` (DGX) to microbench Phase-D+O `T`-formation and the `T·RHS` apply in isolation, with the f64 host oracle measuring `κ(A)` and the tf32-vs-3xtf32 NMSE per rung - mirrors the Gram PoC that proved 6.7×→9.3×. This tells you, cheaply, whether the default precision config will clear the KL-gate before committing to the multi-week kernel (Phase 2 of the scope's plan). + +**Net:** the solve adds zero shared pressure, ~12% mma overhead over the apply, a 4-level (not 64-step) critical path fully hidden under the RHS Grams, an f32-exact inverse, and a cheap-by-default 3xtf32 placement aimed exactly at the recursion-amplified term - so the A-inverse is **not** the bottleneck and **not** the precision wall; the residual risk is the data-dependent `κ(A)` in the weak-decay corner, arbitrated by the KL-gate with the apply-3xtf32 rungs as the escape. + +Files referenced (absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`; `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`; ggml tf32 mma at `ggml/src/ggml-cuda/mma.cuh` (line ~1089); DGX PoC `~/scratch_tc_gdn_poc/gdn_gram_bench.cu`. + +## 3. Register-resident state + C=64 budget + +I have everything I need: the scope doc, the full 0031 kernel (smem layout, the 7 product families, the chunk-loop sequencing), and the sibling FP4-MMA GEMM kernel's concrete patterns (PAD=4 u32 rows, STAGES=2 cp.async, WM=4/WN=2 8-warp grid, ldmatrix, register-resident accumulators). Here is the concrete data-layout design. + +--- + +# Register-resident state + C=64 design (tensor-core chunked GDN, sm_121a / mma.sync) + +## 0. Block/warp configuration (the frame everything hangs on) + +0031 used `block_dims(128,1,1)` = 128 threads, "thread j owns v-column j" - a **column-parallel scalar** model. The tensor-core kernel must abandon that and adopt a **warp-tiled** model (same as the sibling GEMM kernel): + +- **256 threads = 8 warps**, arranged as a **WARPS_M x WARPS_N = 4 x 2** warp grid (the GEMM kernel's proven `WM=4, WN=2`). +- Threads no longer own columns; warp `(wm,wn)` owns a rectangular sub-tile of each matrix and drives `mma.sync` on it. +- Precision: **tf32 m16n8k8** for the S-coupled / decay-coupled products, **bf16 m16n8k16** allowed only for the well-conditioned intra-chunk Gram terms (KK, QK). f32 accumulate throughout. Decays/`gamma`/`beta` stay f32, applied outside the mma (preserve bounded de-gating). + +This 4x2 warp grid is the denominator for every ownership calc below. + +--- + +## 1. The one hard problem: S is an *accumulator* for step 6 but an *operand* for steps 3/4 + +This is the crux the scope hand-waves ("read S as the stationary operand; step 6 accumulates into it"). The register fragment layouts are **not** interchangeable: + +| Use | Role | mma shape | S indexing | Fragment layout | +|---|---|---|---|---| +| Step 6 `S += Kᵀ(D·U)` | **accumulator (D/C)** | m=dk, n=dv, k=C | `S[i][j]`, m=i, n=j | `tile<16,8,float>` acc grid | +| Step 3 `KS = K·S` | **B operand** | m=C, n=dv, k=dk | `S[i][j]`, k=i, n=j | `tile<8,8,float>` B frag | +| Step 4 `QS = Q·S` | **B operand** | m=C, n=dv, k=dk | same as step 3 | `tile<8,8,float>` B frag | + +An accumulator fragment's thread→element map differs from a B-operand fragment's, so **you cannot feed the persistent S registers directly into the step-3/4 mma.** A bridge is mandatory. The design decision: + +> **S lives register-resident in the step-6 ACCUMULATOR layout** (it is written every chunk; that is the hot path). Steps 3/4 reach it via a **once-per-chunk restage to a small smem tile, re-read with `ldmatrix`** as B-operand fragments. + +The restage cost is paid `n_tokens/C` times (not per token) - it is *inside* the BW saving the whole lever buys. And critically, the restage smem **time-multiplexes onto the Uc/Amat region**: at chunk entry (when KS/QS are needed) U and A for this chunk are not yet computed, so their buffers are free to hold the S restage. **Net additional persistent smem for the state: 0KB** - the scope's "0KB shared state" holds, with this scheduling caveat made explicit. + +KS and QS read the **same** pre-update S0, so restage once → do both → then overwrite with U. + +--- + +## 2. Register allocation map (per thread, 256-thread block) + +State `S` is `dk x dv = 128 x 128` f32 = 16384 elems. Distributed over 256 threads = **64 f32/thread** at full dv. + +| Register class | Lifetime | Full dv=128 | dv-slab=64 | dv-slab=32 | Layout / ownership | +|---|---|---|---|---|---| +| **Persistent S accumulator** | whole chunk loop | **64 regs** | **32 regs** | **16 regs** | warp `(wm,wn)` owns dk-rows `[wm·32, +32)` x dv-cols `[wn·(dv/2), +dv/2)`; = 2 m-tiles x (dv/2/8) n-tiles of `tile<16,8,float>`, 4 f32 each | +| Transient A-operand frag | per product | 4 regs/tile | same | same | `tile<16,8,float>` (tf32 packs k8) reused across KK/QK/KS/QS/O/Supd | +| Transient B-operand frag | per product | 2 regs/tile | same | same | `tile<8,8,float>` | +| Transient product accumulator (KK/QK/KS/QS/O) | per product, then spilled to smem | ≤8 tiles·4 = 32 regs | ≤16 | ≤8 | these outputs go to smem; acc is transient, reused | +| A⁻¹ diagonal-block solve (16x16, in-registers) | step 7 only | ~8-12 regs | same | same | one `b=16` unit-lower-tri block per warp-row, scalar Neumann/`` grid), 64/32/16 f32-regs/thread at dv 128/64/32 | +| **Accumulator↔operand bridge** | once-per-chunk `ldmatrix` restage of S to a small smem tile that **overlays Uc∪Amat** (0 net persistent smem); KS+QS share one restage | +| K/Q precision | **bf16** staged (tf32 K/Q breaks the 99KB budget); tf32/f32 reserved for S-coupled + decay-coupled terms | +| Uc / Amat | f32, padded on C (+4) | +| **PAD** | +4 f32 (+8 bf16 = +4 u32) row stride → `ldmatrix` 8-row conflict-free (GEMM-proven) | +| **C=64 default budget** | **≈86KB**, 1 block/SM (CONFIG A) ✅ | +| 2 blocks/SM | C=32 + dv-slab=64 → ≈31KB/block, grid x2 (CONFIG C) | +| dv-slab | grid `(H, n_seqs, n_slabs)`; A/A⁻¹/gates recomputed per slab; S/Uc/O dv-sliced | +| cp.async | STAGES=2 on Kc/Qc (Phase 3 only) | + +One honest caveat surfaced beyond the scope doc: the scope's "~64-80KB / 0KB shared state" budget only holds with **bf16 K/Q staging** and the **overlay restage**; tf32 K/Q (CONFIG B) does not fit even with dv-slab, which is why bf16-Gram + tf32-only-for-S/decay is the forced (and scope-consistent) precision split. + +Files referenced: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`, and the sibling GEMM patterns in `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch`. + +## 4. Occupancy + launch config + +I have everything I need. Key facts confirmed: tf32 tile is `mma.sync.aligned.m16n8k8.f32.tf32.tf32.f32` (mma.cuh:1089, accumulator `tile<16,8,float>` = 4 f32/thread/tile); grid.x = `H` = `n_v_heads` = `ssm_dt_rank` (llama-model.cpp:504; Qwen3-Next family = 32 GDN value heads); the sequential kernel already uses a 3D grid `(H, n_seqs, ceil(S_v/num_warps))` (gated_delta_net.cu:184) — the chunked 0031 collapsed that z-axis to 1, which is exactly what starves the grid. + +Here is the occupancy + launch design. + +--- + +# Occupancy + launch design — tensor-core chunked GDN prefill (sm_121a) + +## 0. The two independent caps 0031 hit (must relieve BOTH for ≥2 blocks/SM) + +0031's -22% is not one wall, it's two stacked walls, and they are relieved by *different* levers: + +| Cap | 0031 value | Binding resource | Lever | +|---|---|---|---| +| **Shared-memory cap** | 89 KB (64 KB all-shared state) | 100 KB/SM, 99 KB dyn opt-in | state→registers **+ smaller C** | +| **Register cap** | n/a (was scalar) | 65536 regs/SM | **dv-slab** the register-resident state | +| **Grid cap** | `(H, n_seqs, 1)` = 32·n_seqs blocks | 48 SMs | **dv-slab multiplies grid** by n_slabs | + +sm_120/121-class per-SM limits used throughout: **1536 threads/SM, 65536 32-bit regs/SM, 100 KB shared/SM (99 KB dynamic opt-in), 255 regs/thread, ≤24-32 blocks/SM (hw, never the binding limit here).** The binding limits are **shared and registers.** + +Critical correction to the scope-doc budget table: it assumes **bf16** K/Q staging (2 B). The precision default is **tf32**, which is a 32-bit container in shared — tf32 K/Q would *double* Kc/Qc and blow C=64 past 99 KB. So the occupancy config **stages K/Q as bf16** (the well-conditioned KK/QK Gram products per the scope's "bf16 only for Gram terms"), keeps gates/decays/beta/the solve in f32. This is a real precision↔occupancy coupling, flagged in §5. + +## 1. Grid mapping — three parallel axes, the chunk axis is serial + +The inter-chunk recurrence carries state `S` across chunks, so **the chunk axis cannot be a grid axis** (it's the sequential dependency — that's the whole algorithm). The only legitimate grid axes that don't break the recurrence are: + +``` +dim3 grid(H, n_seqs, n_slabs); // H = n_v_heads = 32 (ssm_dt_rank) + // n_slabs = dv / dv_tile (the new lever) +``` + +- `blockIdx.x = head` (0..31), `blockIdx.y = seq`, `blockIdx.z = dv-slab`. +- A block owns v-columns `[z·dv_tile, (z+1)·dv_tile)`, walks the chunk loop serially, and keeps **only its `dk × dv_tile` state slab** register-resident. +- This reuses the **same 3D grid shape the sequential kernel already has** (gated_delta_net.cu:184 uses z for S_v-splitting); the chunked kernel repurposes z from S_v-split to dv-slab. The dispatcher change is minimal. + +**Saturation math (the core of the task).** Target ≥2 blocks/SM on 48 SMs ⇒ **≥96 concurrent blocks**. With H=32: + +| n_seqs | dv_tile=128 (n_slabs=1) | dv_tile=64 (2) | dv_tile=32 (4) | +|---|---|---|---| +| 1 | 32 (starved, 0031) | 64 (48/48 SMs busy, 67% warp-occ) | **128 (100%)** | +| 2 | 64 | **128 (100%)** | 256 | +| 4 | 128 | 256 | 512 | + +So **dv-slabbing is simultaneously the register-relief lever and the grid-multiplier** — it's the single most important move. Rejected grid alternatives: split-K over dk (needs cross-block atomic reduction + fights the state carry); batching heads/seqs per block (reduces grid, wrong direction). + +## 2. Block dim / warp count — 8 warps / 256 threads + +``` +constexpr int WARPS = 8; +dim3 block(32 * WARPS, 1, 1); // 256 threads +``` + +Why 8 warps: +- **Clean mma tile partition at C=32:** KK/QK output is `C×C = 32×32` = (32/16)·(32/8) = **8 m16n8 tiles → exactly 1 tile/warp**, dk=128 = 16 k8-steps. Steps 3/4 (KS/QS) and 5 (P·U) → 2 tiles/warp. Step 6 state update `dk×dv_tile`=128×64 → 64 tiles → **8 tiles/warp** (these are the persistent register-resident accumulators). +- **Register dilution:** the register-resident state accumulator is spread across all 256 threads (see §3) — more warps = fewer state-regs/thread. +- **Threads are not the cap:** 256 threads ⇒ up to 6 blocks/SM by the 1536 thread limit, so registers/shared decide. + +Fallback if register-capped (§5): **12 warps (384 threads)** dilutes the state accumulator further (dv_tile=64: 32→21 state-regs/thread) at the cost of thinner per-warp tiles and ≤4 blocks/SM by threads. + +## 3. Register-resident state ↔ dv-slab ↔ occupancy interaction + +The state slab is held as **tf32 mma accumulator fragments** (`tile<16,8,float>`, 4 f32/thread/tile) persisting across the chunk loop. Per-thread state-register cost = `dk·dv_tile / 256`: + +| dv_tile | state f32/block | state regs/thread (256 thr) | + working (est.) | regs/thread | regs/block | reg-allowed blocks/SM | +|---|---|---|---|---|---|---| +| 128 (no slab) | 16384 | 64 | ~50 | ~114 | ~29 K | 2 (tight) | +| 64 | 8192 | 32 | ~50 | ~82 | ~21 K | 3 | +| 32 | 4096 | 16 | ~50 | ~66 | ~17 K | 3 | + +So on registers alone, dv_tile≤64 admits ≥2 blocks/SM. **Shared memory is then the binding cap**, and it's governed by **C**, not dv_tile (Kc/Qc/A all scale with C, only U scales with dv_tile): + +| Config | Kc+Qc (bf16) | A/P (f32) | U (f32) | single | +cp.async dbl-buf K/Q | blocks/SM (shared) | +|---|---|---|---|---|---|---| +| C=64, dv_tile=128 | 32 KB | 16 KB | 32 KB | 80 KB | 112 KB ✗(no room!) | **1** | +| C=64, dv_tile=64 | 32 KB | 16 KB | 16 KB | 64 KB | 96 KB ✓ | **1** | +| **C=32, dv_tile=64** | 16 KB | 4 KB | 8 KB | **28 KB** | **44 KB ✓** | **2** | +| C=32, dv_tile=32 | 16 KB | 4 KB | 4 KB | 24 KB | 40 KB ✓ | **2** | + +**Finding the scope doc missed:** C=64-no-slab is shared-saturated at 80 KB — there is **no room for cp.async double-buffering**, so the 1-block/SM kernel would have *no latency hiding* and likely still lose. C=64 needs dv_tile≤64 *just to make room for cp.async*, and is still 1 block/SM. **Genuine 2 blocks/SM requires C=32** (to drop Kc/Qc/A under the 49.5 KB/block budget). + +## 4. cp.async double-buffering (depth 2, no TMA) + +At 1 block/SM (C=64 path) cp.async is the *only* latency-hiding mechanism, so it's mandatory, not optional. Plain Ampere `cp.async` (`cp.async.commit_group` / `cp.async.wait_group`) — **no `cp.async.bulk`/TMA on sm_121.** Stage the **next chunk's Kc, Qc** (and Vc if the KL-gate doesn't force V from global) into a second shared buffer while the current chunk's mma runs. Depth **2 only** — the sibling GEMM kernel proved multistage saturates BW past depth 2. The double-buffer cost is already in the "+cp.async" column above (44 KB at C=32 keeps 2 blocks/SM). + +## 5. Launch config (concrete) + honest occupancy estimate + +**Recommended default (batched-prefill serving regime, n_seqs≥2):** +``` +C = 32 ; dv_tile = 64 ; n_slabs = 2 ; WARPS = 8 +grid = dim3(H=32, n_seqs, 2) +block = dim3(256, 1, 1) +smem = 44 KB (Kc/Qc bf16 ×2 dbl-buf + A/P f32 + U f32) // cudaFuncSetAttribute return CHECKED (0031 precedent) +→ 2 blocks/SM. n_seqs≥2 ⇒ ≥128 blocks ⇒ 48/48 SMs at full 2-block occupancy (100%), 1.33 waves. + A/Gram/solve recomputed 2× across slabs (state-update per slab is 2× the A work ⇒ ~25% redundant-flop overhead). +``` + +**Single-stream prefill (n_seqs=1) saturator:** +``` +C = 32 ; dv_tile = 32 ; n_slabs = 4 ; WARPS = 8 +grid = dim3(32, 1, 4) = 128 blocks ⇒ 2 blocks/SM on all 48 SMs (100%) even at n_seqs=1. +Cost: A recomputed 4×, and at dv_tile=32 the A bucket ≈ the per-slab state bucket ⇒ ~1.5-2× total-flop overhead. +``` + +**BW-max alternative (1 block/SM, bench against the above):** +``` +C = 64 ; dv_tile = 64 ; n_slabs = 2 ; WARPS = 8 ; smem = 96 KB (dbl-buf, fits 99 KB) +→ 1 block/SM, but 4× state-BW cut (vs 2× at C=32) + grid ×2. n_seqs=1 ⇒ 64 blocks ⇒ 48/48 SMs busy (67% warp-occ). +``` + +**Occupancy summary:** + +| Config | blocks/SM | regs/thread | smem/block | SM util @ n_seqs=1 | SM util @ n_seqs≥2 | state-BW cut | redundant-A | +|---|---|---|---|---|---|---|---| +| 0031 | 1 | scalar | 89 KB | 32/48 busy (starved) | 1024 blk, no overlap | 1× (C=16) | none | +| C=32 dv64 (default) | **2** | ~82 | 44 KB | 48 busy, 67% occ | **100%** | 2× | 2× (~25%) | +| C=32 dv32 (1-seq) | **2** | ~66 | 40 KB | **100%** | 100% | 2× | 4× (~1.5-2×) | +| C=64 dv64 (BW-max) | 1 | ~114 | 96 KB | 48 busy, 67% occ | 100%, multi-wave | **4×** | 2× | + +The C=32 (occupancy) vs C=64 (BW) choice is the empirical fork the scope doc defers to Phase-3 bench: 2 blocks/SM at half the BW saving, vs 1 block/SM at full BW saving + cp.async. **Wire both behind the existing `GDN_CHUNK_MIN` gate plus a `GDN_CHUNK_C` / `GDN_DV_TILE` selector and A/B them; do not assume.** + +## 6. Residual risk — register pressure likely caps it at 1 block/SM (honest) + +The ≥2-blocks/SM result rests on the **~50 working-regs/thread estimate**, which is optimistic: + +- **The blocked-forward-subst A⁻¹ (step 7) is the swing factor.** The in-register 16×16 unit-lower-triangular diagonal inverse + the off-diagonal mma coupling + mma operand fragments + the **accumulator→operand fragment transpose** for reusing the register-resident S as a step-3/4 operand (a `movmatrix`/shared round-trip, since S lives in C-fragment layout but steps 3/4 need it as an A/B operand) can push working regs to **80-120**. At 256 threads, regs/thread > 128 ⇒ > 32 K regs/block ⇒ **silently drops to 1 block/SM** regardless of the 44 KB shared headroom. The scope doc names this exactly: "blocked-forward-subst register pressure trades against state-register pressure; both compete for the same budget." +- **Mitigation ladder, in order:** (i) 12 warps to dilute (dv_tile=64: state 32→21 regs/thread); (ii) `__launch_bounds__(256, 2)` to force the compiler under 128 regs/thread (risks spills to local → BW back); (iii) smaller dv_tile (more grid, more redundant A). If all fail, accept **1 block/SM and lean on cp.async double-buffering + the 4× BW cut + mma throughput** — which is still very likely a win over 0031's serial-f32/-22%, just not the 2-block target. +- **Grid-starvation at n_seqs=1 is structural** (H=32 < 96): only dv_tile=32 fully saturates a single stream, and it pays ~1.5-2× redundant-A flops. This is an inherent floor of the chunked formulation on a 32-head layer; batched serving (n_seqs≥2) is where the design cleanly saturates. +- **Precision↔occupancy coupling:** the 2-block budget assumes **bf16 K/Q** staging. If the KL-gate demands tf32 for the KS/QS S0-products (decay-coupled), that needs a second 32-bit K/Q copy or 3×tf32 — both inflate shared/registers and can knock C=32 back toward 1 block/SM. The occupancy win is contingent on bf16 Gram clearing the gate. + +**Bottom line:** 2 blocks/SM and full 48-SM saturation are *reachable* — via **C=32 + dv-slab (64 for serving, 32 for single-stream) + 8 warps + depth-2 cp.async** — on the **shared-memory** budget. Whether the **register** budget also permits it is the one genuinely open risk, and the A⁻¹ solve's footprint is what decides it; the realistic outcome remains the scope doc's honest **1-2 blocks/SM**, with the win carried by mma + cp.async + the C× BW cut rather than by high occupancy. + +Files: design lives in `backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md` (§3 occupancy plan — this analysis sharpens it); kernel/dispatcher to modify `backend/cpp/llama-cpp-paged-dev/ggml/src/ggml-cuda/gated_delta_net.cu` (grid at :184, chunked launch added by patch `patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`); tf32 tile at `ggml/src/ggml-cuda/mma.cuh:1089`; H source `src/llama-model.cpp:504`. + +## 5. Synthesized build plan + milestones + gate + +Historical note: this plan predates the shipped f32-only M5 tensor-core GDN +path in patch `0047`. Current code parses `GDN_CHUNK_MIN` and `GDN_TC`; the +older `GDN_CHUNK_OFF` and `GDN_CHUNK_TC` names in this section are obsolete. +Phase71 revalidated the current default against sequential-disabled and +serial-chunked modes on DGX and kept M5 as shipped. Use this document as +background for any larger FLA/CuteDSL-class redesign, not as the active next +patch queue. + +All anchors were verified at the time of writing. 0031's kernel body, the +7-step structure, the `GDN_CHUNK_MIN` gating at the `if constexpr (!KDA && +!keep_rs_t)` site, the `launch_gdn_chunked<128,16>` template, the smem formula, +and the test-backend-ops shapes were confirmed. The scope doc's KL gate, +3xtf32 ladder, risk register, and Phase 0-3 plan were confirmed. Here is the +historical build-ready synthesis. + +--- + +# BUILD-READY PLAN: tensor-core chunked-GDN prefill kernel (sm_121a) + +Anchors (absolute): kernel + launch + gate live in `ggml/src/ggml-cuda/gated_delta_net.cu` (grid at :184), added by `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch`. tf32 tile `mma(tile<16,8,float>&D, tile<16,8,float>&A, tile<8,8,float>&B)` = `mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32` in `ggml/src/ggml-cuda/mma.cuh` (m16n8k8 overload ~976-984, dispatch ~1089), gated by `AMPERE_MMA_AVAILABLE`. PAD/cp.async patterns from `patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch`. Gate/precedent docs: `docs/TENSORCORE_GDN_SCOPE.md`, `docs/PAGED_BITEXACT_NOTE.md`, `README.md` s5. Microbench: `~/scratch_tc_gdn_poc/gdn_gram_bench.cu` (DGX). Last patch in series is 0042 → this work is patches 0043+. + +The new kernel is `gated_delta_net_chunked_tc_cuda`, a sibling to 0031's `gated_delta_net_chunked_cuda`. Symbols below reuse 0031's smem names (`Sd, Kc, Qc, Ud, Amat, csh, gam, bet`). + +--- + +## (1) Phase-by-phase kernel structure + +Block = **256 threads / 8 warps** in a **4×2 (WM×WN)** warp grid. State `S` (`dk×dv_tile`) is **register-resident in the step-6 accumulator layout** (`tile<16,8,float>` grid). Grid = `dim3(H, n_seqs, n_slabs)`, `blockIdx.z` = dv-slab. Chunk axis is the serial recurrence (NOT a grid axis). Invariant preserved from 0031: read pre-update `S0` (P3/P4) → solve → output (P5) → **overwrite S last** (P6). Single accumulator, no state double-buffer. + +Per chunk `c0` (the loop body): + +**Phase A - chunk load + gate prefix (f32, cooperative).** Load `Kc[C][dk]`, `Qc[C][dk]` **as bf16** (tf32 K/Q blows the 99KB budget - see §5 of the state design), load `V` chunk. Compute `csh = cumsum(g)` (≤0), `gam = exp(csh)` (≤1), `bet` - all f32, identical to 0031 lines (the `j==0` prefix scan, kept scalar; it is <1KB and hides under the Grams). cp.async depth-2 prefetch of the *next* chunk's `Kc/Qc` starts here. + +**Phase B - state restage (accumulator→B bridge).** The crux. `S0` lives as P6's D/accumulator fragments but P3/P4 need it as a **B operand** (`tile<8,8>`, K-major over `i`). Bounce the `dk×dv_tile` state through a transient smem tile that **overlays the `Ud∪Amat` region** (free at chunk entry - U/A not yet computed) → `load_generic` back as B fragments (NOT `ldmatrix`: it is `.b16`-only, unusable for tf32; use `load_generic`). Paid `n_tokens/C` times, **0KB net persistent smem**. KS and QS share this one restage. + +**Phase C - Gram + state-boundary products (the matmuls that read pre-update S0).** +- **P1 `KK→A`** = `Kc·Kcᵀ`, M=C N=C K=dk, lower-tri (~½ tiles). **tf32-safe** (PoC-proven NMSE ~3e-9). Apply `A = I + tril(βₜ·d(t',t)·KK, -1)` in **f32** after the mma. +- **P3 `KS`** = `Kc·S0`, M=C N=dv K=dk. **3xtf32** (state-boundary, feeds the solve). Output → `Ud` region (becomes RHS). +- **P4 `QS`** = `Qc·S0`, M=C N=dv K=dk, **fused with P3 on the shared S0 B-fragments**. **3xtf32** (γ-attenuated → first demote candidate). Seed the **O accumulator fragments register-resident with `γₜ·QS`** immediately (avoids parking QS in smem through to Phase F). Restage overlay is now free; `Ud`/`Amat` reclaim it. + +**Phase D - A-inverse (form T = A⁻¹ explicitly, then wide apply).** +- **Phase-D inverses:** 4 diagonal `16×16` unit-lower-tri blocks, **f32 in shared-memory column-parallel forward substitution** (thread `c` solves `A_ii x = e_c`). No tensor cores, no reduced precision (this is the strong-coupling amplifier). 4 blocks on 4 warps in parallel, hides entirely under the Phase-C/RHS Grams. +- **Phase-O off-diagonal:** wavefront (anti-diagonal) schedule, critical path `n_b-1=3` not 6. For each i>j: `P_ij = Σ_m A_im·T_mj` then `T_ij = -T_ii·P_ij`, on `m16n8k8`. **3xtf32 default-on** (~64 tiny mma total, negligible). `T` overwrites the `A` scratch in place. + +**Phase E - RHS + apply.** `RHS = βₜ(vₜ - γₜ·KS)` in **f32** (uses P3 result + V) → `Ud`. **`U = T·RHS`** as one dependency-free wide **tf32** GEMM, M=C N=dv K=C (the bulk, 128 mma/warp at full dv), in place → `Ud`. + +**Phase F - intra-chunk output.** +- **P2 `QK→P`** = `Qc·Kcᵀ`, reuse `Amat` (now free after T consumed). **tf32-safe**. Apply `P = d(t',t)·QK` in **f32** (bounded, decay pre-baked - preserves the bounded de-gating invariant). +- **P5 `O += P·U`**, M=C N=dv K=C, P lower-tri (~½ tiles). **tf32-safe** (P f32-bounded first). Accumulate into the O fragments already seeded with `γₜ·QS`. Write `O*scale` to `dst`. + +**Phase G - state carry (overwrites S0 last).** `DU = d(t,last)·U` in f32. **Scale the persistent S accumulator fragments by `γ_last` in f32 in-register first**, then **P6 `S_C += Kcᵀ·DU`** = `Kcᵀ·DU`, M=dk N=dv K=C, **3xtf32 (the strongest ladder candidate - compounds over every chunk)**, accumulated straight into the persistent registers. `Kc` is read **transposed** here (second fragment view, `load_generic` transpose). No restage-out: S stays resident for the next chunk. + +After the loop: final-state write-back (M-layout), identical to 0031's tail. + +Buffer lifecycle (single `Amat`, single `Ud`, as 0031): `Amat`: A(P1) → T(Phase-D/O, in place) → consumed by apply → P(P2) → consumed by P5. `Ud`: KS(P3) → RHS(Phase-E) → U(apply, in place) → read by P5 (B) and P6 (B, scaled to DU). Restage tile overlays `Ud∪Amat` only at chunk entry (Phase B), before either is written. + +--- + +## (2) Build sequence - incremental, each independently GPU-verifiable vs 0031 + +Each milestone is a **separate patch** stacked on 0031, **green on `test-backend-ops GATED_DELTA_NET` + greedy-md5 stable before the next is started**. Reference for every step = the `test_gated_delta_net` op's f64/CPU oracle (already in-tree) and 0031's serial-chunked output. **No milestone integrates on top of an unverified one.** + +| M | Scope | Patch | GPU verification gate (vs 0031 / op oracle) | +|---|---|---|---| +| **M0** | Re-confirm regime, NO code (scope Phase 0) | - | Profile 0031 (`GDN_CHUNK_MIN` low): confirm GDN prefill bucket dominates + grid-starved at low n_seqs. If not, kill the lever now. | +| **M1** | **DGX microbench (NO kernel yet)** - extend `gdn_gram_bench.cu` with KS/QS/PU/KᵀU microkernels + Phase-D/O T-formation + T·RHS apply, each with f64 host oracle measuring **κ(A)** and tf32-vs-3xtf32 NMSE per rung, incl. adversarial `g∈[-20,-1e-4]` | - | **The cheap go/no-go before multi-week kernel work.** Pass = default precision config (f32 diag + 3xtf32 off-diag + tf32 bulk) reaches ~PoC `3e-9`-grade on benign data and survives the κ(A) weak-decay corner within the ladder. Mirrors the PoC that proved 6.7×→9.3×. | +| **M2** | In-kernel: replace **only** step-1/2 serial Grams (KK/QK) with tensor-core tiles. **C=16, scalar everything else, same occupancy** (scope Phase 1 / PoC integration) | 0043 | test-backend-ops 128-shapes green via KL gate (NMSE if it passes); greedy-md5 stable. | +| **M3** | Add **P3/P4 (KS/QS)** tensor-core (3xtf32) + S restage bridge. Still C=16, scalar solve + scalar O/state | 0044 | Same gate. Isolates the accumulator→B bridge correctness. | +| **M4** | **A-inverse** Phase-D (f32 diag) + Phase-O (3xtf32 off-diag), form T; replace 0031's serial fwd-subst. Still C=16 | 0045 | Same gate + the adversarial-decay op case (this is the amplifier). | +| **M5** | **Apply `U=T·RHS`** + **P5 `P·U`** tensor-core. Still C=16 | 0046 | Same gate. | +| **M6** | **P6 `Kᵀ(D·U)`** tensor-core + **register-resident state** (step-6 accumulator layout) + accumulator→B restage in steady state. State leaves smem here | 0047 | Same gate. Frees the 64KB that forced C=16. | +| **M7** | **Flip C=16→C=64, full dv (CONFIG A ~86KB, 1 blk/SM)**, 8-warp 4×2 grid, PAD=4 smem | 0048 | Gate + **first A/B bench vs sequential** (S_PP at n_seqs≥2). | +| **M8** | **Occupancy:** C=32 + dv-slab grid `(H,n_seqs,n_slabs)` (CONFIG C, 2 blk/SM) + cp.async depth-2; selectors `GDN_CHUNK_C`/`GDN_DV_TILE` | 0049 | Gate + A/B bench across {C=32/dv64, C=32/dv32, C=64/dv64-BW-max}; pick winner per regime. | + +--- + +## (3) Bit-exact / KL gate plan + +**md5 is per-path and will NOT match** 0031-serial or the sequential recurrence (different FP reduction order). This is the established `-paged` precedent (`PAGED_BITEXACT_NOTE.md`): per-path md5, validated benign. So: + +- **Binding gate = KL** (not strict NMSE): `KLD(tensorcore ‖ f16) ≤ KLD(sequential ‖ f16)` plus a PPL band, on the README s5 harness. NMSE is *expected to fail* at reduced precision (new path on a new path); NMSE-pass is a bonus, KL-pass is the bar. +- **Stability gate:** greedy-md5 **stable across runs** (deterministic), not equal to the serial path. +- **Adversarial op case mandatory:** `g∈[-20,-1e-4]` (the dangerous middle-decay regime where κ(A) grows); strong-decay underflows to 0 (safe), weak-decay is well-conditioned (tf32's 8-bit exponent holds γ range), the middle is the only empirical risk. + +**Precision default config (the bet that clears the gate):** f32 diagonal inverse (mandatory, already f32) · **3xtf32 off-diagonal coupling** (default-on, negligible ~64-mma cost) · **tf32** Grams + apply · **bf16** K/Q staging (well-conditioned KK/QK only) · decays/γ/β **always f32 outside the mma** (invariant, not a rung). Hold **P6 state carry at 3xtf32 longest** (it compounds over every chunk). + +**3xtf32 ladder (cheapest→dearest) if default misses the gate:** (1) KK Gram→3xtf32; (2) apply **block-diagonal `T_ii·RHS_i`**→3xtf32 (within-window strong coupling, mixed-precision-by-distance); (3) +**δ=1 off-diagonal** apply→3xtf32 (block-boundary adjacent pairs e.g. tokens 15↔16); (4) **full apply**→3xtf32 (≈+2× apply, expensive escape); (5) KS/QS→3xtf32; (6) fall back to direct blocked back-substitution in 3xtf32, else keep 0031's serial path. **Demote order if the gate has margin:** P4→P3, holding P6 at 3xtf32. If even all-3xtf32 misses, the residual is the f32 diagonal solve (already f32) → not fixable by more mma precision → fall to (6). Record the final rung in `PAGED_BITEXACT_NOTE.md` + README s5. + +--- + +## (4) Slot into 0031's existing framework (historical, superseded by 0047) + +Same dispatch site - the `if constexpr (!KDA && !keep_rs_t)` block inside `launch_gated_delta_net` (0031 patch, after `init_fastdiv_values`). Extend, don't replace: + +- Current code keeps `GDN_CHUNK_MIN` as the token threshold and uses `GDN_TC` + as the tensor-core level selector. It does not parse `GDN_CHUNK_OFF` or + `GDN_CHUNK_TC`. +- Historical plan: add **`GDN_CHUNK_TC`** selector: `0` = 0031 serial-solve chunked (fallback, retained), `1` = tensor-core. Add **`GDN_CHUNK_C` ∈ {16,32,64}** and **`GDN_DV_TILE` ∈ {32,64,128}** for A/B; defaults `C=32, DV_TILE=64` (CONFIG C) for serving, `DV_TILE=32` saturator for n_seqs=1. +- New launcher `launch_gdn_chunked_tc<128, C, DV_TILE>` mirrors `launch_gdn_chunked`: `cudaFuncSetAttribute(...MaxDynamicSharedMemorySize...)` **return-checked** (0031 precedent), `grid = dim3(H, n_seqs, n_slabs)`, `block = dim3(256,1,1)`. Per-slab the kernel recomputes A/A⁻¹/gates (dv-independent), dv-slices S/Ud/O. +- **Default OFF** (`gdn_chunk_min=INT_MAX`) exactly as 0031 ships. Flip the default to on **only when** the M8 A/B shows an S_PP win over the tuned sequential recurrence at the serving regime (n_seqs≥2) **and** the KL gate + adversarial op case hold - recorded in README s5 (dev notes / rejected-flat levers) and `PAGED_BITEXACT_NOTE.md`. Until then it ships like 0031: opt-in, regression-free default. +- Extend the test-backend-ops block 0031 added (the `S_v==128` shapes at lines after :9398) so the tc path is exercised at C=64 and C=32 in CI. +- New per-path md5 acknowledged in the dispatch comment (tc-md5 ≠ serial-chunked-md5 ≠ sequential-md5; all benign, KL-validated). + +--- + +## (5) Top 3 risks that could make it NOT beat sequential + kill-criteria + +**Risk 1 - Register pressure forces 1 block/SM (the swing factor).** The ~50 working-regs/thread estimate is optimistic; the A⁻¹ blocked solve (in-register `16×16` diag inverse), the accumulator→B restage transpose, and the O+state transient accumulators can push working regs to 80-120. At 256 threads, >128 regs/thread → >32K regs/block → **silently 1 block/SM regardless of the 44KB shared headroom**, and local-memory spills push BW back. *Mitigation ladder:* (i) 12 warps (dilute state 32→21 regs/thread); (ii) `__launch_bounds__(256,2)`; (iii) smaller `DV_TILE`. **Kill criterion:** if after the full ladder the M8 occupancy build still spills to local OR stays 1 block/SM, **and** the CONFIG-A BW-max 1-block path (C=64, dv64, 96KB, cp.async, 4× state-BW cut) **also** fails to beat sequential S_PP at n_seqs≥2 in the A/B bench → the occupancy lever is dead; keep 0031 serial-chunked behind `GDN_CHUNK_TC=0`, record rejected in README s5. + +**Risk 2 - Precision: tf32 (and even all-3xtf32) misses the KL gate in the weak-decay/aligned-keys κ(A) corner.** The inverse amplifies error; κ(A) is data-dependent and grows where keys align and decay is weak. **Detected cheaply at M1** (microbench measures κ(A) + per-rung NMSE on the adversarial case *before* the kernel exists). **Kill criterion:** if at M1 the **top of the ladder (all-3xtf32 + f32 diagonal)** cannot reach f32-grade on `g∈[-20,-1e-4]`, OR at M4+ `KLD(tc‖f16) ≤ KLD(seq‖f16)` fails on that op case at the top rung → the tensor-core solve is not numerically viable as a default; fall to ladder rung (6) (direct back-subst 3xtf32); if that also misses, abandon the tc solve and keep 0031 serial. **Fail-fast:** M1 gates this before any multi-week kernel commitment. + +**Risk 3 - Grid starvation at n_seqs=1 is structural (H=32 < the ~96 blocks needed for 2 blk/SM × 48 SM).** Only `DV_TILE=32` (4 slabs) fully saturates a single stream, and it pays ~1.5-2× redundant-A flops (A/A⁻¹/gates recomputed per slab) plus the per-chunk restage. **Kill criterion:** if the M8 bench shows single-stream (n_seqs=1) S_PP is slower than sequential even at full saturation (dv32×4) due to redundant-A + restage overhead, **and** the batched regime (n_seqs≥2) gain also fails to materialize → the lever only helps a regime the target workload doesn't hit → keep default-OFF, ship as opt-in experiment only, record. (If n_seqs≥2 *does* win, ship enabled for the serving regime and gate single-stream back to sequential via `GDN_CHUNK_MIN` + an n_seqs check - a partial, honest win.) + +**Overarching kill gate:** the disposition is the bench, not the theory. The kernel flips to default-on only when it beats the tuned sequential recurrence at the serving regime AND clears the KL + adversarial gates. Any milestone that regresses test-backend-ops or md5-stability halts the stack until fixed; M1 and M0 are the cheap fail-fast exits before the expensive kernel work. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md new file mode 100644 index 000000000000..e8b0821a056d --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/TENSORCORE_GDN_SCOPE.md @@ -0,0 +1,362 @@ +# TENSORCORE_GDN_SCOPE - tensor-core chunked gated-DeltaNet prefill (design only) + +**Status: DESIGN + SCOPE ONLY. No kernel written, no GPU run, no PTX in this pass.** +This scopes the follow-up recorded by patch 0031 and README section 5: a +tensor-core (`mma`) chunked gated-DeltaNet (GDN) prefill kernel - the path that +would actually *beat* the tuned sequential scan and close the GDN prefill bucket +toward vLLM. vLLM's chunked GDN scan was measured ~2.5x cheaper in the prefill +ground-truth precisely because it pushes the intra-chunk products through +tensor-core matmuls; patch 0031 proved the chunking math but, with serial +per-thread reductions at the GB10-forced `C=16`, came out ~22% *slower* than the +sequential recurrence. This document scopes replacing those reductions with +`mma.sync` matmuls and lifting the occupancy ceiling. + +> **Read patch 0031 + README section 5 first.** The bounded/stable de-gating form +> (pairwise decays `d <= 1`, `gamma <= 1`), the per-path bit-exact precedent, and +> the honest negative ("C=16 all-shared -> 1 block/SM -> serial reductions -> 22% +> slower, grid-starved at low n_seqs") are the starting point. This doc does not +> re-derive the algebra; it maps it onto tensor cores. + +> **Regime note (the mechanism, read this).** The sequential scan is +> **bandwidth-bound**: it re-streams the entire `128x128` f32 state (64KB) once +> *per token*. README section 5 already records it runs at ~84.7% of GB10 peak BW +> (decode) and the recurrence is a llama *win* vs vLLM's BW. So a tensor-core +> kernel does **not** win by doing the same work faster - it wins by **changing +> the work**: chunking by `C` reads/writes the state `n_tokens/C` times instead of +> `n_tokens` (a ~`C`x cut in state traffic, the dominant prefill GDN cost), and the +> price is `O(C^2)` extra intra-chunk dot-products per chunk. The naive 0031 paid +> that price in serial f32 reductions, which cost *more* than the BW it saved - +> hence 22% slower. **Tensor cores make the added intra-chunk flops nearly free, +> so the BW saving becomes a net win.** That is exactly why vLLM's chunked scan is +> 2.5x cheaper. The whole lever rests on this trade; if a GPU re-profile shows +> prefill GDN is *not* state-BW-bound, stop and re-scope (step 0 below). + +--- + +## 1. GB10 tensor-core reality (sm_121a) - confirmed, not assumed + +GB10 / DGX Spark reports **compute capability 12.1 (sm_121)**, CUDA 13 (README +section "Hardware: GB10 / DGX Spark (CUDA 13, sm_121)"). sm_121a is **consumer +Blackwell** (the SM12x family, same tensor-core programming model as RTX 50 / +sm_120), **not** data-center Blackwell (sm_100a / GB200). This distinction is the +single most important input to the design and is confirmed from sources, not +assumed: + +- **No `wgmma`.** Warp-group MMA is Hopper (sm_90a) only; targeting SM12x yields + `ptxas error: Instruction 'wgmma.fence' not supported on .target 'sm_120'`. + Do **not** design around Hopper-style warp-group MMA. +- **No `tcgen05` / no TMEM.** SM12x lacks the Tensor Memory hardware entirely, so + the autonomous 5th-gen tensor-core path (`tcgen05.mma`, the sm_100a data-center + instruction) is unavailable. This is the same wall that makes vLLM/CUTLASS fall + back to Marlin and gate FP4 to sm_100a on GB10 (tracked in CUTLASS #2800/#2947, + vLLM #43906). We cannot use it either. +- **What sm_121a DOES have: extended `mma.sync`.** The Ampere/Ada warp-level + `mma.sync` family, extended with the Blackwell numeric formats (FP8/FP6/FP4). + "Consumer Blackwell put new data types on top of the oldest programming model." + For our operands (q/k/v/state are f32 in the op, see below) the usable tiles are + the standard warp-level ones: + - **bf16/f16 inputs, f32 accumulate:** `mma.sync.aligned.m16n8k16` (and + `m16n8k8`). 7-bit (bf16) / 10-bit (f16) input mantissa. + - **tf32 inputs, f32 accumulate:** `m16n8k8` / `m16n8k4`. 10-bit input mantissa + - the **highest-precision tensor-core option** on this part, and the one this + design defaults to (the GDN is decay-sensitive; see section 4). + - FP8 (`m16n8k32`) / FP4 (`m16n8k64.kind::mxf4nvf4`, block-scaled) compile on + sm_121a but are **out of scope** here - the GDN q/k/v/state are not 4/8-bit. +- **`cp.async` is available** (Ampere+), so global->shared double-buffering of the + K/Q chunk tiles is on the table for the occupancy phase. There is **no TMA** on + SM12x; staging is plain `cp.async`, not `cp.async.bulk`. + +**Reuse, do not hand-roll PTX.** ggml already ships a warp-level MMA tile +abstraction at `ggml/src/ggml-cuda/mma.cuh` (the `tile` fragments + +`mma()` used by the FlashAttention-mma and MMQ kernels), and it already routes +through `turing_mma_available(cc)` / `ampere_mma_available(cc)` - i.e. it is +sm_121-correct today. Build the GDN matmuls on that API (bf16/half/tf32 fragments, +f32 accumulators), not on raw `asm volatile("mma.sync...")`. This de-risks the +kernel and keeps it consistent with the backend's other tensor-core paths. + +**Bottom line for the design:** the kernel is a **warp-synchronous `mma.sync`** +kernel (Ampere-class programming model with Blackwell silicon), *not* a +warp-group / TMA / tcgen05 kernel. Every "wgmma"/"tcgen05" idea from FLA's +sm_90/sm_100 kernels must be down-translated to `mma.sync` + `cp.async`. Patch +0031's and README's shorthand "mma/wgmma" should be read as **mma only** on this +part. + +--- + +## 2. Mapping the chunked GDN matmuls onto `mma.sync` + +The chunked gated-delta-rule (patch 0031 header) has six dot-product families. +Five are plain matmuls and map cleanly to `mma`; the sixth (the A-inverse) is a +unit-lower-triangular solve and is the one subtle case. Notation: `C` = chunk +length, `dk = dv = S_v = 128` (GDN head dim), per `(head, seq)` block. + +| # | Product (0031 step) | Shape | mma form | Notes | +|---|---|---|---|---| +| 1 | `KK[t,t'] = k_t . k_t'` (for `A`) | `C x C` over `k=dk=128` | `(C x dk) x (dk x C)` | Gram matrix; only strict-lower triangle used. Decay `d(t',t)` + `beta_t` applied **after** mma in f32. | +| 2 | `QK[t,t'] = q_t . k_t'` (for `P`/`O`) | `C x C` over `k=dk` | `(C x dk) x (dk x C)` | Lower triangle (`t' <= t`); decay applied after in f32. | +| 3 | `KS[t,j] = (S0^T k_t)[j]` | `C x dv` over `k=dk` | `(C x dk) x (dk x dv)` | `S0` is the chunk-entry state (stationary operand). Feeds RHS of the solve. | +| 4 | `QS[t,j] = (S0^T q_t)[j]` | `C x dv` over `k=dk` | `(C x dk) x (dk x dv)` | The `gamma_t` cross-chunk term of `O`. | +| 5 | `O += P . U` | `C x dv` over `k=C` | `(C x C) x (C x dv)` | `P` (decay-masked `QK`) times the solved `U`. | +| 6 | `S_C += K^T (D .* U)` | `dk x dv` over `k=C` | `(dk x C) x (C x dv)` | The state update; `D` = `diag(d(t,last))` applied to `U` in f32 first. | +| 7 | `U = A^{-1} RHS` | `C x C` solve, `C x dv` RHS | blocked fwd-subst (see below) | The only non-GEMM. | + +**Critical precision invariant (preserve the bounded de-gating).** Every decay +(`gamma_t`, `d(t',t) = exp(cs_t - cs_t')`) and every `beta_t` stays in **f32** and +is applied as an elementwise scale **before/after** the mma, never inside it. The +mma only ever multiplies the raw, unweighted dot-products (`k.k`, `q.k`, +`S0^T k`, `S0^T q`, `P.U`, `K^T U`). This keeps the strong-decay underflow-to-zero +behaviour (the adversarial `g in [-20, -1e-4]` op test) exactly as 0031 has it - +the numerically delicate part never touches reduced precision. This is the +discipline that makes a tf32/bf16 mma kernel safe for a decay-sensitive op. + +### The A-inverse (step 7) - it CAN be tensor-core'd + +`A = I + N`, `N = tril(beta_t d(t',t) k_t.k_t', -1)` is **strictly lower +triangular**, hence **nilpotent** (`N^C = 0`). Two routes, both better than 0031's +serial per-thread forward substitution: + +- **Blocked forward substitution (RECOMMENDED, this is the FLA "UT transform").** + Partition `C` into sub-blocks of `b` (e.g. `b = 16`, one mma `m`-tile). Invert + each `b x b` diagonal block in registers (it is unit-lower-triangular `b x b`, + cheap: a short serial solve or the finite Neumann series on a `b`-nilpotent, + `<= b-1` terms), then propagate to the off-diagonal sub-blocks with **mma** + (the inter-block coupling `U_i -= A_ij U_j` is exactly a `(b x b) x (b x dv)` + matmul). For `C = 64, b = 16` that is 4 tiny in-register diagonal solves + a + triangular sweep of mma updates - the bulk of the solve is on tensor cores, only + the `16 x 16` diagonals stay scalar. +- **Neumann/Newton-Schulz inverse (fallback).** `A^{-1} = I - N + N^2 - ... ` is + finite (`C` terms) but `O(C)` mma's of `C x C`; Newton-Schulz + (`X <- X(2I - AX)`) converges in `~log2(C)` steps for the nilpotent part. Cheap + in flops, but more numerically exposed than blocked subst for adversarial decays. + Keep as a fallback if blocked subst's register pressure hurts occupancy. + +Verdict: **blocked forward substitution** - it keeps the sensitive diagonal solve +exact-in-registers and tensor-core's only the well-conditioned off-diagonal +coupling. This is precisely the structure FLA/vLLM use, down-translated to `mma`. + +### Tile/chunk design that fits the 99KB shared budget AND feeds the mma + +The 0031 failure was a layout failure: the all-shared `128x128` f32 state (64KB) +crowded out everything and forced `C=16`. The fix is to get the state **out of the +bulk shared footprint**. Two complementary mechanisms: + +1. **State register-resident across the chunk loop (the key move).** `S` only + participates at chunk boundaries (steps 3,4 at entry; step 6 at exit). Keep it + as **mma accumulator fragments distributed across the block's warps** (each + warp owns a `dk x dv` sub-tile of `S`), persisting in registers across the + sequential chunk loop. Steps 3/4 read `S` as the stationary mma operand; step 6 + accumulates into it. This **frees the entire 64KB** - shared then holds only the + per-chunk K/Q/U/A tiles. (The chunked algorithm's whole point is that the heavy + work is intra-chunk and state-free, so the state need not be in shared.) +2. **dv-slab tiling for occupancy (the secondary move).** If register pressure + from a register-resident `128x128` state caps the kernel at 1 block/SM (likely + - that is a lot of accumulator registers), split the `dv=128` value dimension + into slabs (`dv_tile in {64, 32}`). Each warp-group owns a `128 x dv_tile` + state slab. `A` and the solve depend only on `K` (not `dv`), so they are + computed once and the `C x C` `A^{-1}` is **broadcast/recomputed** per slab + (cheap once it is mma'd). This shrinks per-block register/shared pressure and is + the lever for >1 block/SM. + +**Shared budget at `C = 64` (state register-resident), staging K/Q as bf16/tf32:** + +| Buffer | Elems | Bytes | +|---|---|---| +| `Kc` (chunk K) | `C x dk = 64x128` | 16KB (bf16) | +| `Qc` (chunk Q) | `C x dk` | 16KB (bf16) | +| `Uc` (solved U) | `C x dv = 64x128` | 32KB (f32 for the solve) / 16KB (bf16 for the P.U + K^T U mma) | +| `A`/`P` scratch | `C x C = 64x64` | 16KB (f32) | +| gates `cs/gam/beta` | `~3C` | <1KB | +| **state** | (registers) | **0KB shared** | +| **Total** | | **~64-80KB** (under the 99KB opt-in) | + +So **`C = 64` fits the 99KB budget once the state is register-resident** - 4x the +0031 chunk, and a natural multiple of the `m16n8k*` tiles. For >1 block/SM, drop +to `C = 32` + bf16-staged U (`8 + 8 + 16 + 4 = 36KB`, two blocks fit under the +~49.5KB/block needed) and/or dv-slab the state. **Recommended default: `C = 64`, +tf32 mma, state register-resident** (maximize the BW-saving `C` first; chase the +second block/SM only if the bench says occupancy, not BW, is the residual). + +--- + +## 3. Occupancy plan (break the 1 block/SM ceiling) + +0031 is pinned to 1 block/SM by the 64KB shared state. The plan, in priority order: + +1. **Free the 64KB: state register-resident** (section 2). This alone may not give + 2 blocks/SM (the register-distributed `128x128` f32 accumulator is heavy), but + it is the precondition for everything and it lets `C` grow to 64 - which is the + dominant win (`C`x less state BW). Even at 1 block/SM, `C=64` + mma should flip + the sign vs 0031. +2. **dv-slab the state** (`dv_tile = 64` then `32`): halve/quarter the per-block + accumulator-register and shared pressure to admit a 2nd resident block, at the + cost of recomputing the `C x C` `A^{-1}` per slab (cheap on mma). This is the + primary occupancy lever once (1) is in. +3. **`cp.async` double-buffer the K/Q chunk loads**: overlap the next chunk's + global->shared staging with the current chunk's mma, hiding LPDDR5x latency that + 1-2 blocks/SM cannot. No TMA on sm_121, so plain `cp.async` (`commit_group` / + `wait_group`), Ampere-style. +4. **Grid starvation at low `n_seqs`** (0031's other failure: grid is `H x n_seqs`, + ~few hundred blocks): the larger `C` reduces per-block serial chunk steps, and + dv-slabbing **multiplies the grid by the slab count** (`H x n_seqs x n_slabs`), + directly mitigating the low-`n_seqs` starvation that hurt 0031. + +Honest occupancy caveat: a register-resident `128x128` f32 state is a large +register commitment; the realistic outcome is **1-2 blocks/SM**, not high +occupancy. The design leans on **mma throughput + cp.async latency hiding + the +`C`x BW cut**, not on many resident blocks, to win. If profiling shows the kernel +register-capped at 1 block/SM *and* tensor-core-active-% still low, that is the +signal to dv-slab harder (smaller `dv_tile`) or accept the achieved win. + +--- + +## 4. Bit-exactness + precision risk + +This is a **NEW FP path on top of a NEW FP path**. 0031 is already not byte-equal +to the sequential recurrence (different reduction order; README s5 records it as a +benign per-path result). Adding tf32/bf16 mma is a *further* reduced-precision +step. Gate it exactly like the backend's other new-FP-path precedents +(`PAGED_BITEXACT_NOTE.md`, the paged-MoE `8cb0ce23`, the PREFILL_GEMM scope): + +- **Greedy md5 stability** on the standard prompt (README s5 harness) - to catch + *unexpected* divergence on the non-prefill paths (decode must stay on the tuned + sequential kernel and byte-match its reference; this lever is prefill-only and + opt-in, so the default path is untouched). +- **`test-backend-ops GATED_DELTA_NET`** at the 0031 prefill shapes (the + `S_v=128` exact-multiple / tail / multi-seq / GQA / permuted cases), CUDA0 vs the + CPU f32 oracle. **Honest expectation: bf16 mma will likely NOT clear the 1e-7 + NMSE gate; tf32 is borderline.** So the binding gate is the **KL-gate**, not + strict NMSE: require `KLD(tensorcore || f16) <= KLD(sequential || f16)` and PPL + within the established band, recorded in `PAGED_BITEXACT_NOTE.md`. tf32 (10-bit + mantissa, f32 accumulate) is the precision default precisely to give the KL-gate + the best chance. +- **Precision fallback ladder if tf32 fails the KL-gate:** (i) **3xtf32** + emulation (split each f32 operand into 3 tf32 limbs, 3 mma's, recombine - the + CUTLASS fp32-emulation trick; near-f32 accuracy at 3x the mma cost, still far + cheaper than serial f32 loops and still a likely net win given the `C`x BW cut); + (ii) keep the **decay-coupled and state-boundary products in 3xtf32/f32** while + the well-conditioned intra-chunk Gram products use plain tf32 (mixed precision by + sensitivity). Do **not** fall back to bf16 for the decay-sensitive terms. +- **Preserve the bounded de-gating (section 2):** decays/`gamma`/`beta` stay f32, + applied outside the mma. Re-run the adversarial `g in [-20, -1e-4]` op case + specifically; a tensor-core kernel that moved a decay inside the mma would be a + silent precision regression even if the benign cases pass. + +The likely-favourable framing (as in PREFILL_GEMM): keeping the heavy reductions +in f32-accumulate tensor cores is *more* precise than a naive f32 serial loop only +if the inputs stay full-width; here inputs are down-cast (tf32/bf16), so this is a +genuine precision *trade*, not a free win - hence the KL-gate is mandatory and the +3xtf32 ladder exists. Treat NMSE-gate-pass as a bonus, KL-gate-pass as the bar. + +--- + +## 5. Honest effort + expected gain + +**This is a multi-week GPU kernel project, not a routing change.** Unlike the +PREFILL_GEMM dense lever (a dispatch flip onto an existing vendor kernel), there is +no vendor chunked-GDN kernel to route to on sm_121 (CUTLASS/FLA gate the good +paths to sm_100a; that is the whole reason vLLM falls back to Marlin on GB10). We +must write the `mma` kernel ourselves. Realistic estimate: **4-8 weeks** of +focused kernel work, high risk, with non-trivial probability the occupancy/register +wall caps the win. + +**Expected gain (mechanism-grounded, section 0/regime-note):** the lever attacks +the state-BW that dominates sequential GDN prefill by `~C`x (chunking) while +tensor cores absorb the `O(C^2)` intra-chunk flops. Fully realized, it targets +vLLM's ~2.5x-cheaper chunked GDN prefill bucket = the ~17% prefill lever the +ground-truth attributes to GDN. It should also help the serial-SSM portion of the +**decode** residual (README names the irreducible "serial-SSM host loop" as part +of the decode floor; a chunked state-update reduces the per-step state traffic +there too, though decode `n_tokens` is small so the prefill regime is where it +pays). **Honest ceiling:** sm_121 has no wgmma/tcgen05, so we cannot match a +hypothetical sm_100a FLA kernel's throughput - the `mma.sync` path is the Ampere- +class programming model on Blackwell silicon. But `mma` over serial f32 reductions +is an order-of-magnitude flop-rate jump, which is more than enough to flip 0031's +-22% into a win and recover most of the GDN prefill bucket. Do not promise full +parity with vLLM's sm_100-class kernels; promise "beats the sequential scan and +closes most of the GDN prefill gap." + +**Risk register:** +- Register-resident `128x128` state may cap occupancy at 1 block/SM (section 3) - + mitigated by dv-slabbing, but slabbing recomputes `A^{-1}` per slab. +- tf32 may miss the KL-gate -> 3xtf32 ladder (3x mma cost) -> thinner margin. +- The win is contingent on prefill GDN being state-BW-bound (regime note); a GPU + re-profile that says otherwise kills the lever (step 0). +- Blocked-forward-subst register pressure trades against state-register pressure; + both compete for the same budget on a 1-block/SM kernel. + +--- + +## 6. Phased build plan + +Smallest tensor-core proof-of-concept first, bit-exact/KL-gate + A/B bench at every +phase, per `.agents/vllm-parity-methodology.md` (one lever at a time, record +rejected/flat variants, ground-truth both engines). + +### Phase 0 - re-confirm the regime on GPU (NO code) +nsys a **prefill-only** window (`llama-batched-bench -npp -ntg 0/1`, +exclude graph capture) on q36-27b-nvfp4 + q36-35b-a3b, at the backend pin, with +`GDN_CHUNK_MIN` set so 0031 runs. Confirm (a) the GDN prefill bucket is +state-BW-bound (state memcpy/recurrence dominates, tensor-core-active-% low), and +(b) it is ~17% of the prefill step / ~2.5x vLLM's. **If prefill GDN is not +state-BW-bound, stop and re-scope** - the entire mechanism (section 0) fails. + +### Phase 1 - PoC: tensor-core just TWO products, same occupancy +Keep 0031's `C=16` all-shared layout and 1 block/SM. Replace **only** the two +cleanest `C x C` Gram products - step 1 (`KK` for `A`) and step 2 (`QK` for `P`) - +with `ggml/src/ggml-cuda/mma.cuh` tf32 tiles (decays still applied in f32 after). +Leave the solve, the `S0` products, and the state update serial. This is the +minimal "do tensor cores help here at all" probe at fixed occupancy. +- Gate: greedy md5 stable; `test-backend-ops GATED_DELTA_NET` prefill shapes via + the KL-gate (NMSE if it passes). +- Bench: `llama-batched-bench` S_PP, A/B vs sequential and vs 0031-serial, same + harness. **If even this does not move S_PP, the head-dim/occupancy is the wall, + not the reductions - learn it cheaply before the big build.** + +### Phase 2 - full intra-chunk tensor-core + register-resident state + C=64 +State register-resident (free the 64KB), `C=64`, tf32 mma for all of steps 1-6, +blocked-forward-subst `A^{-1}` (step 7) with mma off-diagonal coupling + +in-register `16x16` diagonal solves. Decays/gamma/beta stay f32 throughout. +- Gate: as Phase 1, plus the adversarial `g in [-20,-1e-4]` op case explicitly. + If tf32 misses the KL-gate, climb the 3xtf32 ladder (section 4). +- Bench: S_PP A/B vs sequential, sweep prefill length and `npl`; record the + `C in {32,64,128}` sweep and any rejected `C`. + +### Phase 3 - occupancy + latency hiding +dv-slab the state (`dv_tile in {64,32}`) for a 2nd resident block and to multiply +the grid (fix low-`n_seqs` starvation); `cp.async` double-buffer the K/Q chunk +loads. Tune `C`, `dv_tile`, warp count per the bench. +- Gate: unchanged (the FP path does not change; this is scheduling). +- Bench: final S_PP vs sequential + indicative % of vLLM prefill; name the + residual floor honestly (register-cap / sm_121-has-no-tcgen05). + +### Disposition +Like 0031, ship **opt-in default-OFF first** (extend the existing `GDN_CHUNK_MIN` +gate, add a `GDN_CHUNK_TC` selector if the serial path is kept as fallback). Flip +the default only when a separately-built A/B proves S_PP beats the sequential scan +*and* the KL-gate holds, recorded in README section 5 + `PAGED_BITEXACT_NOTE.md`. +If a phase comes back flat-or-slower, record it as a rejected lever with the reason +(the most valuable output if it fails) and keep 0031's serial path as the shipped +prefill kernel. + +--- + +## 7. Summary + +| Aspect | Decision | +|---|---| +| Tensor-core ISA | **`mma.sync` only** (sm_121a: no wgmma, no tcgen05/TMEM - confirmed) | +| Building block | reuse `ggml/src/ggml-cuda/mma.cuh` tiles, not raw PTX | +| Precision default | **tf32** inputs / f32 accumulate; **3xtf32** ladder if KL-gate misses; bf16 only for well-conditioned Gram terms | +| Decay handling | gamma/d/beta stay **f32**, applied outside the mma (preserve bounded de-gating) | +| A-inverse | blocked forward substitution (FLA UT-transform): in-register diagonal solves + mma off-diagonal | +| Chunk size | **C=64** default (4x 0031), C=32 for 2 blocks/SM | +| State | **register-resident** (frees the 64KB that forced C=16); dv-slab for occupancy | +| Shared budget | ~64-80KB at C=64 state-register-resident (under the 99KB opt-in) | +| Mechanism / why it wins | chunking cuts state-BW by ~Cx; mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not | +| Bit-exact | NEW per-path; **KL-gate** binding (NMSE likely fails at reduced precision), greedy md5 + adversarial-decay op case | +| Effort | **multi-week (4-8 wk), high risk**; no vendor kernel to route to on sm_121 | +| Expected gain | beats the sequential scan, closes most of the ~17% GDN prefill bucket toward vLLM's 2.5x; also helps the decode serial-SSM residual. NOT full sm_100-class parity. | +| Phasing | P0 re-profile -> P1 two-product PoC -> P2 full intra-chunk + C=64 + reg-state -> P3 occupancy/cp.async; opt-in default-OFF until A/B-proven | + +Decode is untouched (this is prefill-only, opt-in); the stock `llama-cpp` backend +stays patch-free. This lever lives entirely in `llama-cpp-localai-paged`. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md b/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md new file mode 100644 index 000000000000..835d8c3a6ff4 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/UPSTREAM_LAYER2_SCOPE.md @@ -0,0 +1,343 @@ +# Layer-2 upstream scope: native fused-GDN kernels for Metal / Vulkan / SYCL + +Source-only analysis (no GPU, no build) of what it would take to give the +gated-DeltaNet (GDN / SSM) decode fusions native kernels on the non-CUDA compute +backends, so the patch-series decode win extends past CUDA-family hardware. + +This doc is the GDN/SSM-fusion (benefit #1) detail. For the umbrella scope that +also covers the paged KV block-table flash-attn read (benefit #2), the free +host-side scheduler (benefit #3), the out-of-scope NVFP4 track (benefit #4) and a +ROCm note - and the combined per-backend sequencing - see +[`ACCELERATOR_PORTING_SCOPE.md`](ACCELERATOR_PORTING_SCOPE.md). + +In our changeset (patches 0018-0030) these fusions ship with CUDA native kernels ++ CPU reference kernels ONLY; patch 0030 force-gates them OFF on Metal / Vulkan / +SYCL (a CPU-fallback fused op would regress via the device round-trip, and a +backend that ran the plain op on the discriminated node would silently +miscompute). "Layer 2" is the upstream work that adds the missing native kernels. + +This doc was written against the ggml backend trees in +`backend/cpp/llama-cpp-paged-dev` (upstream base #24732, one commit OLDER than the +series pin `c299a92c` #25045, with only the two paged-KV patches applied - neither +touches GDN/SSM). So every "kernel already exists" statement below is a +conservative lower bound: the pin has at least these kernels. + +-------------------------------------------------------------------------------- +## 0. Headline finding (correct a stale assumption first) + +The series README (section 4c) says "the gated-DeltaNet op has no Vulkan kernel +upstream, so the Qwen3.6 hybrid models assert / fall back and don't run there." +**That is now stale.** All three backends already carry the BASE compute ops: + +| op | Metal | Vulkan | SYCL | +|------------------------|------------------------------------|------------------------------------------|---------------------------------| +| GGML_OP_GATED_DELTA_NET| `kernel_gated_delta_net_impl` (f32, NSG 1/2/4) | `gated_delta_net.comp` (d16/32/64/128 x kda, shmem/cluster/nocluster variants) | `gated_delta_net.cpp` (`launch_gated_delta_net`) | +| GGML_OP_SSM_CONV | `kernel_ssm_conv_f32_f32` (+ `_4`, + batched) | `ssm_conv.comp` (+ APPLY_BIAS, APPLY_SILU specialization consts) | `ssm_conv.cpp` (`kernel_ssm_conv`) | +| GGML_OP_SSM_SCAN | yes | `ssm_scan.comp` (mamba2) | `ssm_scan.cpp` (mamba2) | + +Verified: Vulkan `gated_delta_net.comp` was last touched at the upstream base +commit (#24732), not by any LocalAI patch. So the GDN COMPUTE op is present on +Metal, Vulkan AND SYCL. The Qwen3.6 hybrids therefore DO run on all three today +(via the upstream non-fused path that 0030 routes to). The Layer-2 value-add is +the decode SPEEDUP from the fusions, NOT enabling the model to run at all. + +Consequence: the GDN-compute op being "partly there" is true on every backend, +not just Metal. What is still missing per backend is only the FUSION plumbing +(in-place write-back target, the ids gather read, and the conv-update kernel) - +a materially smaller scope than "port GDN from scratch." + +-------------------------------------------------------------------------------- +## 1. Per-op semantics (the four fusions to port) + +All four reuse an existing GGML_OP enum with extra `src[]` slots as a +discriminator; none adds a new enum value. f32 throughout. The arithmetic core +is IDENTICAL to the upstream non-fused op; only the read source and/or the write +target are redirected. That single fact drives the whole bit-exactness story +(section 3). + +### OP A - `ggml_gated_delta_net_inplace` (patch 0018) +- Enum `GGML_OP_GATED_DELTA_NET`, discriminated by a non-null `src[6]` = + `state_dst` (a contiguous `[S_v*S_v*H, n_seqs]` view into the recurrent-state + cache at `kv_head`). K == 1 only. +- Semantics: run the standard GDN recurrence, but write the FINAL recurrent state + directly into `state_dst` instead of appending it to the op output. The op + output then carries only the attention scores. Removes the per-layer per-step + ~full-state D2D copy-back (the 0018 win). +- Race (in-place read == write): each (seq, head) block owns a disjoint cache + slot. The kernel loads the whole prior state `s0` into per-thread registers + (`s_shard` on CUDA, `ls[NSG]` on Metal, the column shard on Vulkan/SYCL) + BEFORE the ring write, so reading and writing the same slot is safe. + +### OP B - `ggml_gated_delta_net_inplace_ids` (patch 0019) +- Adds `src[5]` = FULL state cache `[S_v,S_v,H,n_rs_slots]`, `src[7]` = `ids` + (I32, per-seq source slot == the recurrent-state `s_copy`), `op_param[1]` = + `rs_head` (destination base slot). Still has the OP-A `src[6]` in-place target. +- Semantics: read each sequence's prior state directly from `cache[ids[seq]]` + (mirrors `ggml_ssm_scan`'s ids source), eliminating the `ggml_get_rows` + materialization. Combined with OP A the op now reads AND writes the cache in + place. +- Race: identity sequences (`ids[s] == rs_head + s`, the steady AR-decode case) + read s0 in place from the destination slot (safe via the register snapshot + above). Non-identity sequences (reorder / rs_zero remap) are first copied by a + TINY separate gather kernel (`gdn_gather_nonident`, one block/seq) into a + DISJOINT scratch that the recurrence then reads, so the recurrence never reads + a slot another block is writing. Value-preserving memcpy -> bit-identical to + the get_rows path. + +### OP C - `ggml_ssm_conv_update_inplace` (patch 0021) +- Enum `GGML_OP_SSM_CONV`, discriminated by a non-null `src[3]` = + `conv_state_dst` (`[(K-1)*channels, n_seqs]` in-place ring view). + `src[0]` = conv_states `[K-1, channels, n_seqs]`, `src[1]` = conv_kernel + `[K, channels]`, `src[2]` = x_cur `[channels, 1, n_seqs]`. `op_param[0]` = + fuse_silu. +- Semantics (decode, n_seq_tokens == 1): per (channel, sequence) assemble the + width-K conv window in registers from the K-1 cached taps + the current token, + compute the depthwise conv with the SAME ascending-tap FMA order as plain + `ssm_conv` (`tap0*w0 + ... + xc*w_{K-1}`, then `+0.0f` to match plain conv's + `sumf += b` with b==0), optionally fold SiLU, write the conv output + `[channels,1,n_seqs]`, and write the 1-token-shifted ring state back in place. + Replaces the 4-op decode conv chain (transpose + concat + conv + silu + ring + cpy). +- Race: read source (gathered taps) and write target (cache view) are disjoint + buffers -> race-free by construction, no ids/identity logic. + +### OP D - `ggml_ssm_conv_update_inplace_ids` (patch 0028) +- Same enum, discriminated by a non-null `src[4]` = `ids`; `src[0]` becomes the + FULL conv cache `[K-1, channels, n_cells]`; `op_param[1]` = rs_head. +- Semantics: gather-free conv-update - read each sequence's prior taps from + `cache[ids[s]]` in-kernel (no get_rows). Identity reads in place from + `conv_state_dst`; non-identity gathered into a disjoint scratch first by a tiny + `ssm_conv_gather_nonident` kernel. The window is copied to a local array + BEFORE the (possibly aliasing) ring write so the identity read==write slot is + correct. Bit-identical to get_rows + OP C. + +### Net new kernels vs reuse, per op +- OP A: NOT a new compute kernel - a write-target redirection of the EXISTING + GDN kernel + 1 buffer binding + a supports_op/op-handler branch. +- OP B: the GDN kernel gains a per-seq read-base select (identity vs scratch) + + 1 ids binding + rs_head param + 1 tiny gather kernel. +- OP C: a GENUINELY NEW kernel on each backend. The existing `ssm_conv` computes + a windowed reduction over a PRE-concatenated input; it does not assemble the + window from cached taps + the current token, fold silu, or write the shifted + ring state. This is the largest net-new piece. +- OP D: the OP-C kernel gains the read-base select + 1 ids binding + rs_head + 1 + tiny conv gather kernel. + +The `ggml.h` / `ggml.c` builders, the CPU reference kernels, the model-graph +emission (`delta-net-base.cpp`, qwen35*), and the `test-backend-ops` cases are +SHARED and already done by patches 0018/0019/0021/0028. The only NEW per-backend +work is the kernel(s) + the backend wiring. + +-------------------------------------------------------------------------------- +## 2. Per-backend: authoring model, effort, gotchas, wiring + +### 2.1 Metal (MSL) + +Authoring model: `.metal` MSL source (`ggml-metal.metal`), function-constant +specialization (e.g. `FC_GATED_DELTA_NET`), kernels templated on `NSG`; host +glue split across `ggml-metal-ops.cpp` (`ggml_metal_op_*` encode), the pipeline +lookup in `ggml-metal-device.cpp`/`.m`, the kargs struct in `ggml-metal-impl.h`, +and `supports_op` in `ggml-metal-device.m`. Threadgroup model; Apple GPU +simdgroup width is a FIXED 32, `simd_sum` for the per-column reduce. + +Effort: MEDIUM. ~350-500 LOC. The GDN and plain-ssm_conv kernels already exist +and are ergonomic to extend. OP A is a write-base redirect of the existing +`kernel_gated_delta_net_impl` (its tail already does +`dst_state = dst + attn_size + state_out_base; dst_state[is] = ls[j]` after +loading `ls[]` into registers - just point `dst_state` at the `state_dst` buffer +and add the binding). OP C is the one net-new MSL kernel (Metal has NO bias/silu +ssm_conv variant today - only plain + `_4` + batched - so the silu-fold and ring +write are both new). Host glue spans 3-4 files. + +Gotchas: +- In-place race: the existing kernel ALREADY snapshots the state column into + `ls[NSG]` registers before writing, so OP A/B are safe with no barrier; OP C/D + must mirror the `float window[K]` local-copy-before-write that CPU/CUDA use. +- Discriminated SSM_CONV: `supports_op` for `GGML_OP_SSM_CONV` currently returns + `has_simdgroup_reduction` with NO check of `src[3]`/`src[4]`; GDN returns + `has_simdgroup_reduction && src[2]->ne[0] % 32 == 0` with NO check of + `src[6]`/`src[7]`. Both must be tightened (accept the discriminated variant + only once the kernel exists) AND `ggml_metal_op_ssm_conv` / + `ggml_metal_op_gated_delta_net` must branch on the extra src to pick the kernel. +- Bit-exactness: fixed 32-wide simdgroup makes this the SIMPLEST of the three - + the fused variant only redirects addresses, so it is bit-identical to Metal's + own non-fused path by construction (the conv per-channel FMA needs the exact + ascending order + the `+0.0f`). +- The kargs struct grows by the `state_dst` / `ids` / `rs_head` fields; a new + pipeline name (or a function-constant branch) distinguishes the variants. + +### 2.2 Vulkan (GLSL .comp -> SPIR-V) + +Authoring model: GLSL `.comp` in `vulkan-shaders/`, compiled at build time by +`vulkan-shaders-gen` into embedded SPIR-V byte arrays (`gated_delta_net_f32_data` +etc.); pipeline creation in `ggml-vulkan.cpp` declares the binding count + +push-constant size; a push-constant struct per op; host dispatch `ggml_vk_*` +binds subbuffers; `supports_op` in the device support function. Subgroup size +VARIES by vendor (NVIDIA 32, AMD 64, Intel 8/16/32). + +Effort: HARDEST. ~450-650 LOC + the most build/host glue. Same kernel logic as +Metal/SYCL, but every new shader or variant requires: the shaders-gen regen, a +new `ggml_vk_create_pipeline` registration with an explicit binding count and +push-constant size, a new/extended push-constant struct (add `rs_head`), and +GROWING the descriptor binding set from the current 7 (`src[0..5]` + dst) to 8-9 +(`state_dst`, `ids`). The GDN host dispatch hardcodes a 6-src bind loop and the +pipeline is created with `"main", 7, ...` - both must change. + +Gotchas: +- Subgroup variance interacts with the EXISTING variant matrix: the GDN comp + already ships shmem / cluster / nocluster variants keyed on subgroup size and + relies on `S_V % COLS_PER_WG == 0`. The OP-A/B read/write redirect must be + applied across ALL of those variants, and re-validated per vendor. +- In-place race: GLSL must read the full column shard into local registers before + the ring write (same pattern); confirm the SPIR-V memory model is not relied on + for cross-invocation ordering (it is not - blocks are disjoint per (seq,head)). + OP C/D need the explicit window-to-local copy. +- Discriminated SSM_CONV: `supports_op` returns `op->src[0]->type == F32` with NO + discriminator check; GDN loops `src[0..5]` F32 with NO `src[6]`/`src[7]` check. + Both must be tightened. This is the backend where the 0030 hazard is most + concrete (a present plain-conv kernel + a permissive supports_op = silent + miscompute) - Vulkan is the exact case 0030 was written for. +- conv-update is per-channel (one invocation per channel) so it is + subgroup-AGNOSTIC; only the GDN recurrence carries the subgroup-width burden. +- Vulkan's `ssm_conv.comp` ALREADY has APPLY_SILU + APPLY_BIAS specialization + constants, so the silu-fold half of OP C is partly precedented here (unlike + Metal); the ring write-back + tap-window assembly are still new. + +### 2.3 SYCL (single-source DPC++) + +Authoring model: plain C++ `.cpp`/`.hpp` per op (`gated_delta_net.cpp`, +`ssm_conv.cpp`); a SYCL `queue.parallel_for` over an `nd_range` with +`reqd_sub_group_size(WARP_SIZE)`; sub-group reductions (`warp_reduce_sum`); +`supports_op` in `ggml-sycl.cpp`. NO separate shader-compile step (single +source). + +Effort: EASIEST to author. ~250-350 LOC. The SYCL op handlers + kernels are +near-VERBATIM mirrors of the CUDA ones (`launch_gated_delta_net`, +`s_shard`, `curr_state`, `state = dst + attn_score_elems`, `warp_reduce_sum`) - +a dpct/SYCLomatic-style port. The CUDA diffs in 0018/0019/0021/0028 would port +almost line-for-line: add the `state_dst` param, the `ids`/`rs_head` params, the +read-base select, the two tiny gather kernels, and the new conv-update kernel. +No pipeline/push-constant/binding bookkeeping. + +Gotchas: +- In-place race: the `s_shard[]` / window arrays are per-work-item private, so + the register-snapshot-before-write pattern carries over directly. Safe. +- Discriminated SSM_CONV: `supports_op` checks `src[0]`/`src[1]` F32 with NO + discriminator check; GDN returns a BARE `true` (the MOST permissive, so the + hazard is worst here). Both must be tightened, and `ggml_sycl_op_ssm_conv` / + `ggml_sycl_op_gated_delta_net` must branch on the extra src. +- Bit-exactness: `WARP_SIZE` is compile-fixed (Intel sub-group 8/16/32), same + situation as CUDA; the fused variant matches SYCL's own non-fused path by + construction. conv-update is per-channel -> subgroup-agnostic. + +### 2.4 Common wiring (all three) + the 0030 emission-gate change + +Per backend, four wiring touch-points beyond the kernel body: +1. `supports_op`: tighten the `GGML_OP_SSM_CONV` and `GGML_OP_GATED_DELTA_NET` + entries so the discriminated/extra-src node is reported supported ONLY when + the new kernel handles it (and rejected otherwise, instead of today's + silently-true-for-the-plain-kernel). +2. op handler: branch on `src[3]`/`src[4]` (conv) and `src[6]`/`src[7]` (GDN) to + dispatch the fused kernel. +3. pipeline/kernel registration (Vulkan: + push-constant struct + descriptor + bindings; Metal: + kargs fields + pipeline name; SYCL: just the new functions). +4. The patch-0030 gate in `src/llama-context.cpp`. + +The 0030 change today is a hard allow-list: any non-CPU compute backend whose reg +name is not `"CUDA"`/`"ROCm"`/`"MUSA"` forces `fused_gdn_ar = fused_gdn_ch = +auto_fgdn = false`. As each backend gains kernels this must become capability- +driven, in one of two ways: +- minimal: add the backend's reg name (e.g. `"Metal"`) to the allow-list once its + kernels + tightened supports_op ship; OR +- clean (recommended upstream form): DELETE the name allow-list and make + `supports_op` authoritative - have the `auto_fgdn` resolution probe + `ggml_backend_dev_supports_op` on a representative node that carries the + discriminated `src[]` slots. Then routing falls out of the normal scheduler + fallback and no backend name is ever hard-coded. This also fixes 0030's stated + weakness that the upstream `auto_fgdn` check only inspects GATED_DELTA_NET + nodes and covered the discriminated SSM_CONV only incidentally. + +-------------------------------------------------------------------------------- +## 3. Bit-exactness per backend (the md5 gate question) + +Feasible on ALL THREE, and not actually constraining, because of how the gate is +scoped: + +- The series md5 gate is a CUDA-vs-CPU comparison; each GPU backend ALREADY has + its own f32 reduction order (Metal `simd_sum`, Vulkan subgroup reduce, SYCL + `warp_reduce_sum`) that differs from CUDA's and from CPU's. There is no + cross-backend md5 and none is expected. +- The relevant per-backend invariant is: the FUSED variant must equal that + backend's OWN non-fused path. The fusions change only the read source + (gather -> indexed read; the gather is a value-preserving memcpy) and the write + target (appended output -> in-place cache slot). They do NOT touch the + per-column FMA/reduce order. So the fused op is bit-identical to the + non-fused op on the same backend BY CONSTRUCTION. +- Two arithmetic details each port MUST preserve exactly: (a) the conv + ascending-tap order plus the `+0.0f` that matches plain `ssm_conv`'s + `sumf += b` with b==0; (b) the existing GDN per-column subgroup reduce (do not + re-order it). Get those right and `test-backend-ops` (backendX-vs-CPU, already + registered for SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / + GATED_DELTA_NET) is the per-backend gate. + +-------------------------------------------------------------------------------- +## 4. Upstream path and ranked recommendation + +### Ops-first, then one PR per backend (NOT one big PR) + +Recommended sequence: + +1. PR #1 - OPS (already essentially done, upstreamable as-is): the `ggml.h`/ + `ggml.c` builders, the CPU reference kernels, the CUDA kernels, the + `test-backend-ops` cases, and the capability-driven gate (the clean + `supports_op`-authoritative version of 0030). This is independently mergeable + and mirrors how llama.cpp lands new ops (CPU + CUDA first; GDN itself landed + that way). +2. PR #2 - Metal kernels + wiring. +3. PR #3 - SYCL kernels + wiring. +4. PR #4 - Vulkan kernels + wiring. + +Do NOT bundle the backends: each needs its own hardware to validate +`test-backend-ops`, reviewers are backend-specialized, and a regression in one +must not block the others. + +### Value x effort ranking (which backend first) + +| backend | user base / value | author effort | bit-exact difficulty | net rank | +|---------|----------------------------|---------------|----------------------|----------| +| Metal | HIGH (Apple Silicon = largest non-CUDA LocalAI base; unified memory makes the no-copy / no-gather plumbing wins map directly) | MEDIUM | LOWEST (fixed 32 simdgroup) | **1st** | +| SYCL | LOW-MED (Intel GPU) | LOWEST (near-verbatim CUDA mirror) | LOW | **2nd** | +| Vulkan | HIGHEST breadth (AMD + Intel + cross-vendor) | HIGHEST (shaders-gen + variant matrix + subgroup variance + descriptor growth) | MEDIUM (per-vendor subgroup validation) | **3rd** | + +Recommendation: **Metal first.** It banks the biggest user-facing decode win at +medium effort, the base GDN + conv kernels already exist, and Apple's fixed +simdgroup width makes bit-exactness the simplest. **SYCL second** as a cheap, +nearly mechanical follow-on (the port is a line-for-line CUDA mirror, so it is +low-cost insurance even though the Intel-GPU audience is smaller). **Vulkan last** +as the high-effort / high-breadth capstone - it reaches the widest hardware +(AMD + Intel + anything with a Vulkan driver), but the shader-gen pipeline, the +existing variant matrix, the subgroup-width variance, and the per-vendor +validation burden make it the right capstone once the pattern is proven on +Metal + SYCL. + +A reasonable cheaper variant: ship Metal + SYCL together right after the ops PR +(both are register-snapshot ports with no shader-gen step) and treat Vulkan as a +separate later effort. + +-------------------------------------------------------------------------------- +## 5. Summary + +- GDN-compute and plain SSM_CONV kernels ALREADY EXIST on Metal, Vulkan and SYCL + (the README's "no Vulkan kernel" line is stale). The Qwen3.6 hybrids run on all + three today via the non-fused path; Layer-2 is about the decode SPEEDUP. +- Per backend the NEW work is: redirect the GDN state write (OP A) + add the ids + read (OP B) to the existing GDN kernel, write ONE new conv-update kernel + (OP C) + its ids variant (OP D), add two tiny gather kernels, and tighten + supports_op + the op-handler branch + (Vulkan) the pipeline/push-constant/ + descriptor wiring. The builders, CPU refs, model graph and tests are shared and + already done. +- Bit-exactness is feasible everywhere and per-backend by construction (the + fusions redirect addresses, not the f32 reduction order); `test-backend-ops` + (backendX-vs-CPU) is the gate. +- Sequence: ops-first PR (incl. the capability-driven replacement for 0030's + name allow-list), then Metal, then SYCL, then Vulkan. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md new file mode 100644 index 000000000000..1cdd31ab9fe1 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md @@ -0,0 +1,417 @@ +# vLLM Parity - Final State (Qwen3.6 NVFP4 on GB10) + +> 2026-06-30 update: this document records the earlier final-state verdict. The +> investigation has since been reopened; see `GB10_PARITY_REOPEN_SPEC.md`, +> `GB10_PARITY_PHASE0_RESULTS.md`, and the active `docs/superpowers/plans/` +> Phase 6/Phase 7 files for the current measured state and follow-up scope. + +> **Status: CLOSED.** This is the standing record of the exhaustive GB10 (DGX +> Spark, sm_121) parity investigation for `llama-cpp-localai-paged` against vLLM +> on the Qwen3.6 hybrid gated-DeltaNet NVFP4 models. It exists so the +> investigation is **never re-litigated**: every lever attempted, its verdict, +> its key number, and the structural floors that bound the result are recorded +> below with the artifact each number came from. The one-line conclusion: +> **prefill is genuinely capped at 36-43% of vLLM (FP4-MMQ optimality + GDN +> O(C^2) intra-chunk complexity; prefill is not CUDA-graph-replayed, so these are +> real floors, not profiling artifacts); decode-serving is near-parity at ~86% of +> vLLM's true GPU-steady decode (the long-standing ~56% headline was a +> measurement / operating-point artifact, corrected below), with the residual +> ~14% being vLLM's mature fused-Marlin + Triton-elementwise kernels that are not +> cheaply replicable on GB10.** + +Companion docs (design/rationale, not re-summarized here): the patch-series +[`README.md`](../README.md) (section 5 dev-notes), `VLLM_PARITY_LEVER_MAP.md`, +`PREFILL_GEMM_SCOPE.md`, `PREFILL_GEMM_RESULTS.md`, `DECODE_SERVING_SCOPE.md`, +`TENSORCORE_GDN_SCOPE.md`, `TENSORCORE_GDN_BUILD_PLAN.md`, `PAGED_BITEXACT_NOTE.md`. + +Source key (every number below cites one of these): +- **CDEF** = the definitive same-session both-engine run `dgx:~/bench/COMBINED_DEFINITIVE.txt` (2026-06-29, GIT_HEAD `a7d439e`, h2h_cli3 OpenAI `/v1/completions`, fresh-nonce prompts, ignore_eos, ptok128 gen128; paged `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1`, GDN M5 on, S1 on, S3 off; vLLM 0.23.0 gpu-util 0.85 max-model-len 4096 max-num-seqs 256 tp1). +- **README** = the static `llama-batched-bench` table in [`README.md`](../README.md) section 4 (npp128/ntg128; patched vs stock-`9d5d882d` vs vLLM-prior). +- **PGR** = `PREFILL_GEMM_RESULTS.md`. **LMAP** = `VLLM_PARITY_LEVER_MAP.md` (profile-validated section). **DSS** = `DECODE_SERVING_SCOPE.md`. **MG** = `dgx:~/bench/marlin_gate/`. **GDNAB** = `dgx:~/bench/gdn_p1_ab/`. **0034/0035** = patch headers in `patches/paged/`. +- **HNP** = the clean, uncontended, **graph-node-traced** both-engine high-N decode profile (2026-06-30): `dgx:~/highN_prof2/*.nsys-rep` (paged, npl=256) + `dgx:~/highN_vllm/*.nsys-rep` (vLLM), captured with `nsys --cuda-graph-trace=node` and decomposed by the **difference method** (per-token cost = ntg=64 profile minus ntg=16 profile). **This supersedes every earlier decode decomposition** (LMAP included): those were taken without `--cuda-graph-trace=node`, which collapses each graph replay into one opaque launch and made the per-kernel decode attribution an artifact (see 2c). +- "estimated" marks any figure not pinned to one of the above. + +--- + +## 1. The benchmark (paged vs vLLM vs stock) + +Two models: the MoE **Qwen3.6-35B-A3B-NVFP4** (decision model, 256 experts top-8, +30 GDN + 10 full-attn layers + a dense shared expert per layer) and the dense +**Qwen3.6-27B-NVFP4** (48 GDN + 16 full-attn). All numbers GB10 / CUDA 13 / +sm_121. The current backend pin is `0ed235ea2c17a19fc8238668653946721ed136fd`; +the CDEF benchmark artifact itself records the dev-tree commit that produced +those binaries. + +### 1a. Prefill (S_PP, prefill tokens/s) + +Paged = static `llama-batched-bench` PP block; vLLM = server prefill-phase rate +at the same prompt length. Source: **CDEF**. + +| Model | shape | paged S_PP | vLLM S_PP | paged % of vLLM | +|---|---|---:|---:|---:| +| MoE 35B-A3B | PP=512, B=32 | 2309.6 | 6418.9 | **36.0%** | +| MoE 35B-A3B | PP=2048, B=32 | 2401.9 | 6748.5 | **35.6%** | +| Dense 27B | PP=512, B=32 | 960.3 | 2277.3 | **42.2%** | +| Dense 27B | PP=2048, B=32 | 1010.2 | 2360.1 | **42.8%** | + +Prefill is the largest absolute gap. The profile-validated decomposition (LMAP, +nsys both-engine, MoE decision model) attributes it as: paged **395.9 us/tok** vs +vLLM **197.0 us/tok** (total gap ~198.9 us/tok), split GDN **+59.2** (~30%), +MoE-GEMM **+56.5** (~28%), ew/layout/glue **+21.4** (~11%), act-quant **+15.2** +(~8%), bf16-proj **+13.7** (~7%), gate **+12.4** (~6%), norms **+11.1** (~6%), +dispatch **+5.9** (~3%). + +### 1b. Decode / serving (per-seq + aggregate decode t/s), staggered serving + +Source: **CDEF** NPL runs (continuous serving via h2h_cli3). `decode_agg` = +aggregate decode t/s; `perseq` = decode tok/s/seq; PEAK_GB = peak process VRAM. + +**MoE Qwen3.6-35B-A3B-NVFP4:** + +| N | paged decode_agg | vLLM decode_agg | paged perseq | vLLM perseq | perseq % of vLLM | paged TTFT_mean ms | vLLM TTFT_mean ms | paged PEAK_GB | vLLM PEAK_GB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 8 | 208.1 | 297.1 | 25.68 | 36.68 | **70.0%** | 747.9 | 204.2 | 50.03 | 112.42 | +| 32 | 379.1 | 575.7 | 11.40 | 17.49 | **65.2%** | 2377.9 | 640.8 | 52.13 | 112.20 | +| 128 | 611.9 | 958.2 | 4.14 | 6.97 | **59.4%** | 7058.3 | 1965.4 | 60.57 | 112.51 | +| 256 | 717.8 | 1177.4| 2.29 | 4.12 | **55.6%** | 13533.6 | 3937.3 | 70.18 | 112.55 | + +**Dense Qwen3.6-27B-NVFP4:** + +| N | paged decode_agg | vLLM decode_agg | paged perseq | vLLM perseq | perseq % of vLLM | paged TTFT_mean ms | vLLM TTFT_mean ms | paged PEAK_GB | vLLM PEAK_GB | +|---:|---:|---:|---:|---:|---:|---:|---:|---:|---:| +| 8 | 84.0 | 72.1 | 10.42 | 8.93 | **116.7%** | 1914.7 | 493.1 | 77.97 | 109.63 | +| 32 | 196.5 | 214.7 | 5.83 | 6.56 | **88.9%** | 7023.3 | 1735.4 | 83.04 | 109.65 | +| 128 | 343.8 | 431.8 | 2.18 | 3.10 | **70.3%** | 19468.9 | 5455.0 | 101.93 | 109.67 | +| 256 | 380.3 | 532.5 | 1.13 | 1.82 | **62.1%** | 36306.8 | 10824.1 | 114.63 | 109.67 | + +End-to-end aggregate `agg_tps` (incl. prefill contention), **CDEF**: MoE paged +179.7/301.4/425.6/459.9 vs vLLM 278.5/515.6/798.3/915.4 at N=8/32/128/256; dense +paged 72.6/141.4/205.8/213.3 vs vLLM 69.4/193.3/346.6/394.7. + +**Reading the table.** Dense decode is **ahead of vLLM at low concurrency +(116.7% at N=8)**. The high-N percentages here (perseq ~56%, decode_agg ~61% at +N=256) are **server-window** numbers and **understate true engine parity**: they +divide the paged serving rate by vLLM's *prefill-overlap-inflated* server rate. +The corrected, graph-node-traced decomposition (section 2c, **HNP**) shows paged +decode at **~86% of vLLM's true GPU-steady decode**, with the remaining +server-window gap being an S3-recoverable serving graph-reuse overhead (2d). The +earlier "this is just the bandwidth floor / vLLM pays equally" reading was a +**profiling artifact** and is corrected in 2c. + +**PEAK_GB is the structural memory advantage.** vLLM's PEAK_GB is a **fixed +~109-112.5 GB reservation** (the `--gpu-memory-utilization 0.85` block-manager +pre-allocation of the ~128 GB unified LPDDR5x) and does **not** vary with N. The +paged backend allocates KV on demand, so its peak **grows with load** but stays +far below vLLM at low/mid concurrency: MoE N=8 uses **50.0 vs 112.4 GB (~2.2x +less)**, and even at N=256 MoE is 70.2 vs 112.6 GB. This is the headline of +section 5 (memory advantage / higher max concurrency per GPU) and is real, +bit-exact, and not an operating-point trick. + +### 1c. Patched vs true-stock (static batched-bench, the patch-series multiplier) + +Stock `9d5d882d` was not in the same-session CDEF run; the patched-vs-stock +multiplier is the static `llama-batched-bench` table (**README**, npp128/ntg128, +decode t/s): + +| | N=8 | N=32 | N=64 | N=128 | max x over stock | +|---|---:|---:|---:|---:|---:| +| Dense patched / stock | 85.3 / 68.3 | 211.9 / 119.9 | 305.2 / 142.8 | 382.1 / 155.1 | **2.46x** | +| MoE patched / stock | 230.3 / 186.7 | 466.4 / 267.4 | 622.4 / 320.5 | 784.3 / 347.2 | **2.26x** | + +In that **static** regime the patched decode kernel is **at vLLM parity** +(dense 121/100/99/91% of vLLM-prior across widths; MoE 90/93/91/89%). The serving +table in 1b is the harder continuous regime; the gap between the two regimes is +the subject of section 2 (serving) and was fully closed on the host side. + +--- + +## 2. Complete lever map (every attempt, verdict, key number) + +Bit-exactness convention (per `PAGED_BITEXACT_NOTE.md`): the gate is **per-path**. +Dense greedy md5 `5951a5b4`; paged-MoE greedy md5 `8cb0ce23` (a benign +FP-accumulation-order reorder vs non-paged `07db32c2`, KL-validated). "BE" = greedy +md5 byte-identical; "KL-benign" = new FP path, gated by KL-divergence within band. + +### 2a. PREFILL - weight GEMM track (verdict: FP4-MMQ is optimal on GB10) + +Four kernels were built or ported to beat MMQ at large-M MoE prefill. **All +rejected; FP4-MMQ stays the shipped path.** The decisive surprise (LMAP, both-engine +nsys): **on sm_121 vLLM itself does not run native FP4** - it runs **Marlin W4A16** +(FP4 dequant to bf16 in-register + bf16 GEMM) for experts and FP8 projections, +capped at bf16-tensor-core peak (~half FP4 peak). So MMQ's native FP4 path is +already structurally competitive on this exact silicon. + +| Lever | What | Verdict | Key number | Source | +|---|---|---|---|---| +| **0033** dequant -> bf16 cuBLAS | route large-M NVFP4 dense GEMM off MMQ to dequant->bf16 nvjet/cuBLAS | **REJECTED** (regression) | dense S_PP **-49% / -42% / -29%** at M=512/1024/2048; bit-exact md5 identical, KL-better | PGR | +| dense-cuBLAS reroute (full sweep) | the same reroute across the dense + MoE prefill sweep | **REJECTED** | **-31% to -62%** band (estimated; the artifact-pinned dense subset is -29% to -49%, PGR) | LMAP / recorded verdict | +| **0034** native FP4-MMA W4A4 | Blackwell `mxf4nvf4` OMMA large-M kernel, PoC verbatim | **REJECTED in-backend** | PoC `~103 TFLOP/s` (57.7% of FP4 peak, beats cuBLAS-bf16, NMSE=0), but the standalone PoC win **did not hold in-backend** | 0034 header / LMAP | +| **0035** W4A16-Marlin grouped MoE | FP4->bf16 in-register dequant + bf16 `mma.sync`, zero act-quant tax (vLLM's exact sm_121 shape) | **REJECTED** (perf regression) | correct + bit-exact-gated: `test-backend-ops MUL_MAT_ID` 81/81; KL **benign and better** (marlin KLD **0.131** < MMQ **0.137**, same-top-p 84.6% vs 84.3%); md5 short identical, long one benign flip - but **-39%** S_PP vs MMQ (estimated/recorded; MG holds only the correctness+KL gate) | 0035 header, MG | +| offline-repack Marlin / vLLM-verbatim Marlin | repack weights offline to Marlin layout; port vLLM's Marlin kernel verbatim | **REJECTED** | verbatim-Marlin: **correct but -39%**; offline-repack: workflow built (shared the GPU lock, `combined_definitive.sh:29`), same bf16-peak ceiling, no win | recorded verdict / combined_definitive.sh | + +**Why the whole track loses (the structural reason):** bf16 tensor-core peak on +GB10 is **~half FP4 peak** (PGR s3), so any dequant->bf16 kernel caps at ~half the +throughput the native FP4-MMQ read reaches; and the dequant write is an +un-amortized weight-sized memory pass (~8x the FP4-read byte traffic, PGR). The +W4A16 angle was the most promising because it *also* erases the ~8% act-quant tax +vLLM never pays - but the bf16-peak ceiling still made it a net regression. **MMQ +is optimal; the GEMM bucket is not winnable on GB10 with the available kernels.** + +### 2b. PREFILL - GDN chunked-scan track (verdict: M5 tf32 C=16 is the shipped winner) + +The gated-DeltaNet chunked scan is the **#1 single prefill-gap contributor** +(+59.2 us/tok, ~30% of the gap; LMAP). vLLM's FLA `chunk_gated_delta_rule` runs the +same math at **36.5 us/tok vs paged 95.7 = 2.62x** (LMAP), pushing intra-chunk Gram +products through tensor cores. The series chased that headroom. + +| Lever | What | Verdict | Key number | Source | +|---|---|---|---|---| +| **0031** scalar-serial chunked scan | FLA-style chunk gated-delta-rule, scalar/serial form (`GDN_TC=0`) | superseded | math-correct (`test-backend-ops` 91/91, <=1e-7 NMSE) but **~761 vs ~971 t/s = ~22% slower** at the GB10-forced C=16 | README s5 | +| **0047 / M5** tf32 tensor-core scan | full form-T solve + state-update on tf32 `m16n8k8` mma, f32-only re-port | **SHIPPED (default-on under paged)** | MoE prefill S_PP **+3.5% @npp512 (3x A/B), +17.7% @npp2048**; decode unchanged; bit-exact-benign (`GATED_DELTA_NET` 46-94/94, md5 == canonical) | README s3/s5 | +| bf16 CONFIG-C (M8) | bf16 `Kc/Qc` + 2 C*C scratch, C->64 + 2 blk/SM | **REJECTED** (not in f32-only series) | the run that confirmed the geometry (CDEF GIT_HEAD), then dropped | CDEF / README s5 | +| bf16-C16 | bf16 Gram at C=16 | rejected | no win over tf32-M5; bf16 mantissa unsafe on the state-coupled products | GDN build-plan s4 | +| BV block-occupancy A/B (tf32) | raise blocks/SM to test if occupancy is the bound | **REJECTED** (occupancy is NOT the bound; latency is wave-hidden) | two arms statistically equal: **1844 vs 1814 S_PP (-1.04%, within noise)** | GDNAB armA/armB | +| bf16-C64 | bf16 Gram at the larger C=64 chunk | **REJECTED** | **-18.75%** - the O(C^2) intra-chunk triangular-solve + serial recurrence dominates, so growing C hurts | recorded verdict / GDN build-plan | +| Phase 10 C32 slab M5 | C=32 with two `dv_tile=64` slabs, default-off `GDN_C32_SLAB=1` | **REJECTED** | md5-clean after tail-row zeroing, but S_PP regressed: MoE 2048 **2430.32 -> 2054.86**, dense 2048 **1019.25 -> 903.73** | phase10 gates/ab | +| Phase 11 QS-early M5 | move `QS = Qc * S0` earlier, default-off `GDN_M5_QS_EARLY=1` | **REJECTED** | md5-clean, but S_PP regressed slightly: MoE 2048 **2441.54 -> 2420.26**, dense 2048 **1021.06 -> 1015.77** | phase11 gates/ab | +| Phase 12 shared-A/Ai cost model | f32 Ai scratch shared across two C32 value slabs | **GO to one prototype** | BT32 f32 scratch at npp2048,npl32: MoE 256 MiB / 768 MiB Ai traffic; dense 384 MiB / 1152 MiB Ai traffic | phase12 cost model | +| Phase 13 Global-Ai32 | precompute f32 Ai once, consume from two C32 `dv_tile=64` slabs | **REJECTED** | md5-clean, but S_PP regressed: MoE 2048 **2425.10 -> 2097.76**, dense 2048 **1016.14 -> 918.19** | phase13 gates/ab | + +**Why the bottleneck is not occupancy/dtype:** the cost is the **O(C^2) +intra-chunk triangular solve + the serial inter-chunk recurrence dependency**, not +grid occupancy (BV: -1.04%, latency is wave-hidden) and not Gram dtype (bf16-C64: +-18.75%). GB10's 99 KB +dynamic-smem cap forces **C=16** (the 128x128 f32 state alone is 64 KB of the +all-shared layout), and at this head dim the only win is tensor cores on the +intra-chunk products, not chunking or wider chunks. M5 tf32 at C=16 is exactly +that and is the shipped winner; it does not fully close the 2.62x because vLLM's +mature FLA blocked-solve is a more complete tensor-core implementation. + +Post-record caveat closed: Phase 13 tested the one permitted +`GDN_GLOBAL_AI32=1` prototype. It was correctness-clean but slower, so GDN kernel +work on GB10 should stop rather than moving to f16 Ai or additional local +reorders. + +### 2c. DECODE / serving (verdict: near-parity at ~86% of vLLM's true GPU-steady decode; the earlier "BW-floored / vLLM pays equally" was a profiling artifact) + +**Methodology correction - why every earlier decode decomposition was wrong.** +Decode runs as a **replayed CUDA graph**. `nsys` *without* `--cuda-graph-trace=node` +collapses each graph replay into a **single opaque launch**, so the per-kernel +attribution in every prior decode profile (the "paged 159 us/tok, GPU ~16% busy, +host-bound, 5.4x more GPU-efficient per token" picture, and the conclusion that the +high-N gap was a pure bandwidth floor vLLM pays equally) was an **artifact of graph +collapse, not real per-token cost**. The correct method, used for the numbers below +(**HNP**, clean uncontended node, 2026-06-30), is `nsys --cuda-graph-trace=node` +plus the **difference method**: per-token cost = the ntg=64 profile minus the +ntg=16 profile, isolating per-token-linear work from fixed per-step overhead. Under +this method **paged decode at npl=256 is 99% GPU-busy (GPU-idle only 1.4%), NOT +host-bound** - the opposite of the collapsed-graph reading. This supersedes the +LMAP decode decomposition. + +**The real per-token decomposition (paged, npl=256, HNP)** - GPU-steady ~1082 +us/tok (924 t/s): + +| Bucket | us/tok | % of decode | Note | +|---|---:|---:|---| +| GDN recurrent scan | 553 | **51%** | **LINEAR in batch** - the dominant cost; shared BW floor (below) | +| NVFP4 expert GEMM | 254 | 23% | amortizes with batch | +| bf16 projections | 73 | 7% | | +| elementwise | 57 | 5% | | +| SSM conv | 31 | 3% | | +| rest | small | - | | +| GPU-idle | - | **1.4%** | not host-bound | + +**The gap reconciled (the numbers must sum).** The headline N=256 figures (perseq +~56%, decode_agg ~61%, section 1b) were paged-**server** **718** over vLLM-**server** +**1177**. But the vLLM server number is **inflated ~8 pts**: vLLM's true GPU-steady +decode is **1078 t/s**, and its chunked-prefill overlap inflates the +server-measured decode window. The reconciled chain: + +| Measurement | t/s | % of vLLM-server (1177) | +|---|---:|---:| +| vLLM server (CDEF) | 1177 | 100% | +| vLLM **true GPU-steady** decode | 1078 | 92% | +| llama **GPU-steady** decode | 924 | 78.5% (**= 86% of vLLM's true 1078**) | +| llama server (CDEF) | 718 | ~60.7% (61%) | + +So **vs vLLM's true GPU-steady decode, paged is ~86%, not ~56%.** The ~56% headline +conflated two distinct things: vLLM's prefill-overlap-inflated server window, and +the paged serving graph-reuse overhead. The **~17 pt** drop from llama GPU-steady +(78.5%) to llama server (60.7%) is exactly that **serving graph-reuse overhead**, +which is **S3-recoverable** (2d). + +**GDN is a shared BW floor where paged is ahead.** The GDN recurrent scan moves +**~32 GB/step of f32 recurrent-state traffic**; paged runs it at **83% of the +273 GB/s LPDDR5x peak vs vLLM's 79%**. Both engines' high-N sublinearity (only +**1.17-1.18x throughput for a 2x batch**) comes from this **shared** floor - it is +not a paged-specific loss, and paged is the faster of the two on it. + +**The residual ~14 pt GPU-steady gap is real but not cheaply closable.** vLLM's +GPU-steady 1078 vs paged 924 decomposes into two buckets: the **MoE expert path +(~+11 ms)** - vLLM's fused Marlin persistent-tiling vs ggml's separate act-quant + +MMQ - and **elementwise (~+10 ms)** - vLLM fuses it into one Triton kernel. Both +fusions were attempted and rejected (table below). Closing the residual needs +vLLM's mature Marlin tiling (our own ggml Marlin port already lost **-19.6%**) plus +multi-stream overlap (hard inside a single-stream CUDA graph): **low-EV, +multi-week, GB10-uncertain**. + +**Decode / fusion levers (verdicts).** + +| Lever | What | Verdict | Key number | Source | +|---|---|---|---|---| +| act-quant folded into ggml MMQ | erase the act-quant pass by quantizing the y-operand inside the MoE expert MMQ kernel (vLLM's fused-Marlin single-pass shape) | **REJECTED** (regression) | **-79.4%**: ggml MMQ re-quantizes the y-operand **once per weight-row-tile x stream-k split**, with no tensor cores for the inline quant - structural, ggml MMQ lacks vLLM's persistent single-pass tiling | HNP / recorded verdict | +| norm + quant + silu fusion | fold the elementwise path into one launch (vLLM's Triton kernel) | **REJECTED** (architecturally infeasible) | `ggml_cuda_can_fuse` cannot express it: FP4 quant is a **mul_mat-internal prologue, not a cgraph node**; the norm is already fused (0042/0044); silu is separated from the norm by **2 GEMMs + the router** | recorded verdict | +| Q8_0 / FP8 projection | quantize the bf16 GDN/attn projections (premise: vLLM uses FP8 here) | **REJECTED** (regime error, not premise error) | vLLM **does** use FP8 projections (confirmed from `hf_quant_config.json` `MIXED_PRECISION`), but at N=128/256 projections are only **~12% of the decode stream**, so this closes **<=6%, not the gap** | HNP / hf_quant_config.json | +| NVFP4 the bf16 GDN/attn projections | drop projections to NVFP4 (more aggressive than FP8) | **REJECTED** | **KL-fail, ~+6% PPL**; vLLM keeps the SAME bf16/FP8 projections, never NVFP4 | LMAP | +| W4A16-Marlin MoE decode | Marlin grouped expert GEMM on the decode path | **REJECTED** | BW-floored wash, **~5% slower** kernel | LMAP | +| bf16-tau per-head SSM (0026) | per-head bf16 tau on the SSM decode | **DROPPED** | flat **780.6 vs 780.0 t/s** once the fusion patches landed | README s5 | +| D3 FA-split / D4 GDN-width-adaptive | the older "off critical path" decode levers | **SUPERSEDED reasoning** | originally rejected via the now-debunked "5.4x faster / host-bound" reading; under HNP the GDN scan **is** the critical path (51%), but it is the shared BW floor where paged already leads (83% vs 79%), so neither is a win | HNP | + +**Dense decode is AHEAD at low N (116.7% @ N=8, CDEF)** because the GPU is +underutilized there and the paged path's per-token efficiency wins; this is the one +operating point where paged is unambiguously faster than vLLM. + +### 2d. SERVING / engine (verdict: host loop and scheduler closed; spec-decode orthogonal) + +| Lever | What | Verdict | Key number | Source | +|---|---|---|---|---| +| **0040 / S1** paged decode-graph reuse | correct `can_reuse` keyed on bucketed block-table dims | **SHIPPED (default-on)** | serving graph reuse **0% -> 72.2%** (with S3); static **0% -> 95.5%** | README, DSS | +| **0041 / S3** decode-shape-stable scheduling (`LLAMA_PAGED_DECODE_STABLE`) | keep prefill out of decode steps for reuse-stable shapes | **SHIPPED default-OFF** (opt-in throughput-max knob) | recovers the **~17 pt serving graph-reuse overhead** (llama server 60.7% -> toward GPU-steady 78.5%, 2c) at a TTFT cost; default-on regressed real serving: **2.5x worse TTFT** (60s vs 24s @N=256), **20-29% lower** end-to-end throughput, hence opt-in | README, DSS, HNP | +| **0043 / D1** full-step MoE decode CUDA graph | graph the whole decode step incl. grouped-MMQ MoE dispatch | **SHIPPED (default-on)** | +2.6% (npl128) to +5-13% (npl32); the D1 premise "host-sync on MoE-routing readback" was **REFUTED** (sync count identical graphs on/off; 99% GPU-busy static) | README s5 | +| S2 double-buffer set_inputs | overlap host input build with GPU | **DROPPED** | `set_inputs` is **~0.05 ms/step** - nothing to recover (the rebuild was the cost) | DSS | +| whole-step graph / host loop | the host scheduling loop as the serving residual | **CLOSED (~0-1%)** | baseline reuse 0% (agg 757.6) **statistically equal** to S1+S3 reuse 72% (agg 763.3); `hostproc` only ~4-8% of the per-step wall = **measured dead** | DSS | +| padded / fixed-slot decode | pad decode width to `--parallel` for ~100% reuse | **REJECTED (built, GPU-tested)** | inert (md5 bit-exact) but **regresses at every concurrency**; N=8 burst 28.16 -> 6.05 tok/s/seq (~4.6x slower); serving decode is **GPU-compute-bound**, dummy-row compute > reuse recovered | DSS | +| speculative decode (MTP) | draft + verify; greedy is bit-exact | **REJECTED for current GB10 serving** | Phase 14 passed safety, but Phase 15 direct serving A/B regressed at every tested concurrency (n128 decode agg 662.4 -> 138.5 tok/s) despite high acceptance; Phase 16 profile supports graph-reuse loss as root cause (`graphs reused` 62 -> 1 in the small nsys run). Not a parity lever unless a future graph/batch-shape fix changes this result | LMAP | + +The serving regime was the one place the static-bench parity did not carry over +(paged ~3.7 vs vLLM ~5.9 tok/s/seq, -39%, DSS). S1 made the decode step reusable +and the host loop was driven to ~0-1% of the wall. The graph-node-traced HNP +profile (2c) then resolves the remaining serving gap into two parts: the **~17 pt +serving graph-reuse overhead** (S3-recoverable via this knob) and the **~14 pt +GPU-steady kernel gap** vs vLLM's true 1078 t/s (vLLM's fused-Marlin MoE + Triton +elementwise, 2c). Both are real; neither is the "pure LPDDR5x floor, vLLM pays +equally" story the collapsed-graph profile implied. + +--- + +## 3. Structural floors (not closable on GB10) + +These are the hardware/algorithm ceilings the investigation hit. They are why +parity is unreachable on this part, and they are the levers' "why" in one place. + +1. **LPDDR5x bandwidth (~273 GB/s) bounds the GDN recurrent scan - a *shared* + floor where paged leads.** The GDN scan is the dominant decode bucket (553 + us/tok, 51%, LINEAR in batch; HNP) and moves ~32 GB/step of f32 recurrent + state; paged runs it at **83% of the 273 GB/s peak vs vLLM's 79%**, and both + engines' high-N sublinearity (1.17-1.18x for a 2x batch) is this same floor. + This is **not** the explanation for the high-N server-window gap: the + graph-node-traced HNP profile (2c) shows paged decode **99% GPU-busy at ~86% of + vLLM's true GPU-steady decode**, with the server-window ~56% being a + prefill-overlap measurement artifact (~8 pt) plus an S3-recoverable graph-reuse + overhead (~17 pt), not a bandwidth floor vLLM pays equally. The residual ~14 pt + GPU-steady gap is kernel maturity (point 4 below + 2c), not bandwidth. On + datacenter HBM (B200: ~8 TB/s) this GDN floor lifts ~30x. + +2. **FP4-MMQ optimality at GB10's tensor-core ratios.** Native FP4-MMQ at M<=128 is + at the FP4 weight-BW floor (decode) and beats every dequant->bf16 alternative at + large M (prefill), because bf16 TC peak is ~half FP4 peak on sm_121 and the + dequant pass is an un-amortized memory pass (PGR). vLLM itself is on a **bf16 + Marlin fallback** here (no tcgen05/CUTLASS-grouped FP4 on consumer Blackwell, + CUTLASS #3096), so there is no faster GEMM to port. + +3. **GDN O(C^2) intra-chunk solve + serial inter-chunk recurrence.** The chunked + scan's cost is the triangular A-inverse solve (quadratic in chunk size C) plus + the strictly-serial cross-chunk state carry, with C forced to 16 by the 99 KB + smem cap. Occupancy (BV: -1%) and dtype (bf16-C64: -18.75%) are not the bound; + only a fuller tensor-core blocked-solve closes the residual 2.62x, and M5 tf32 + captures the tractable part. + +4. **vLLM's mature fused kernels (FLA blocked-solve, fused-Marlin MoE, Triton + elementwise) are tuned for HBM.** They are the source of both the prefill cap + and the residual ~14 pt decode GPU-steady gap (2c): the fused-Marlin + persistent-tiling MoE path (~+11 ms) and the single-kernel Triton elementwise + (~+10 ms). The matching ggml fusions were rejected as infeasible or regressive + (2c): folding act-quant into MMQ regressed -79.4% (no single-pass tiling), and + norm+quant+silu cannot be expressed via `ggml_cuda_can_fuse`. The FLA chunked + GDN, Marlin grouped GEMM, and FULL/PIECEWISE cudagraphs all assume datacenter + bandwidth and TC ratios; they are real wins on B200, which is why closing the + residual is a different-hardware question (mature kernels + multi-stream + overlap), not a missing single-lever optimization. + +--- + +## 4. Shipped wins (all bit-exact / KL-benign) + +What the series actually banks, all gated per-path: + +- **FP4-MMQ MoE/dense GEMM** - native Blackwell FP4-MMA, at the FP4 weight-BW + floor (decode parity) and beating every dequant alternative at prefill. The + reason the whole 2a track stays default-off. +- **M5 tf32 tensor-core chunked GDN prefill (patch 0047)** - default-on under + `LLAMA_KV_PAGED`; MoE prefill **+3.5% @npp512, +17.7% @npp2048**, decode + untouched, bit-exact-benign. +- **0042 fused residual-add + RMSNorm + weight-mul** - one kernel for `h = x + + sub; n = rms_norm(h) * w`; dense S_PP +0.5%, bit-exact. +- **0044 fused gated RMSNorm + SiLU gate-mul (GatedRMSNorm fusion)** - the GDN + output norm `(rms_norm(x)*w)*silu(z)` folded into one launch (672 -> 336 + launches @npp512); S_PP dense +1.1%, MoE +0.9%, `test-backend-ops` 12979/12979. +- **0046 GDN-prefill geometry gate** - gates patch 0022's decode occupancy retune + by scan length so it stops regressing dense prefill; recovers **+7.2%** dense + prefill back to stock parity while keeping the decode win, bit-exact. +- **SSM decode fusion stack (0018-0022, 0028)** - in-place state, fused gather, + o_proj MMQ reshape, conv in-place, occupancy retune; the **2.26x/2.46x over + stock** decode multiplier (README). +- **Serving host loop closed (0040 S1, 0043 D1)** - decode-graph reuse and + full-step graph capture; host loop driven to ~0-1% of the serving wall. +- **The memory advantage** - **1.5-3x lower VRAM** than vLLM (NVFP4-resident, no + persistent bf16 dequant copies; CDEF PEAK_GB e.g. MoE N=8 50 vs 112 GB), which + is a legitimate higher-max-concurrency-per-GPU operating point. +- **Low-N decode efficiency** - dense decode **ahead of vLLM (116.7% @ N=8)**. +- **Bit-exact output** - per-path greedy md5 stable (dense `5951a5b4`, paged-MoE + `8cb0ce23`), the sacred gate held through the entire series. + +--- + +## 5. The parity verdict and the path + +**Verdict (revised): PREFILL is genuinely capped on GB10; DECODE-SERVING is near +vLLM parity (~86% of its true GPU-steady decode), with the long-standing ~56% +headline now identified as a measurement / operating-point artifact.** Prefill +sits at **36% (MoE) / 43% (dense)** of vLLM and is a real floor (FP4-MMQ optimality ++ GDN O(C^2) intra-chunk complexity; prefill is **not** CUDA-graph-replayed, so +unlike decode these numbers are not profiling artifacts). The GDN chunked scan is +at its tractable tensor-core win (M5) and the prefill GEMM bucket is FP4-MMQ-optimal +(every alternative rejected; vLLM is itself on a bf16-Marlin fallback here). For +decode, the graph-node-traced HNP profile corrects the record: paged decode is +**99% GPU-busy at ~86% of vLLM's true GPU-steady decode (924 vs 1078 t/s)**; the +~56% server-window figure was vLLM's prefill-overlap inflation (~8 pt) plus the +S3-recoverable serving graph-reuse overhead (~17 pt). The residual **~14 pt** +GPU-steady gap is vLLM's mature fused-Marlin MoE (~+11 ms) and Triton elementwise +(~+10 ms) kernels; the matching ggml fusions were rejected (act-quant-into-MMQ +-79.4%, norm+quant+silu infeasible), and closing the residual needs mature Marlin +tiling (our port lost -19.6%) plus multi-stream overlap - low-EV, multi-week, +GB10-uncertain, not a free bit-exact lever. + +**The honest framing:** on GB10 the paged backend is **at or ahead of vLLM at low +concurrency (dense 117% @N=8), uses 1.5-3x less memory, and is bit-exact**, runs +high-N decode at **~86% of vLLM's true GPU-steady decode** (the ~56% server-window +number is a measurement artifact, 2c), and sits at **~36% (MoE) / ~43% (dense) of +vLLM prefill**. The prefill residual is a real FP4-MMQ + GDN-O(C^2) floor; the +~14 pt decode residual is vLLM's mature fused kernels, not engineering debt and not +a cheap lever. + +**The path to parity is different hardware.** A datacenter Blackwell (B200, +~8 TB/s HBM, native tcgen05/CUTLASS FP4, TMEM) lifts the bandwidth floor ~30x and +**restores exactly the vLLM advantages that lose on GB10**: its FLA blocked-solve +GDN, its Marlin/CUTLASS grouped FP4 GEMM, and its HBM-tuned full-cudagraph decode +all assume that bandwidth and those TC ratios. On that hardware the parity question +is re-opened from scratch; on GB10 it is closed. Do not re-litigate the GB10 levers +- re-run the methodology on the new silicon instead. + +--- + +*Recorded per `.agents/vllm-parity-methodology.md` (both-engine ground-truth, +per-lever A/B, record-rejected-levers). All GPU numbers from `ssh dgx.casa` +artifacts under `~/bench/`; all in-repo numbers from the docs cited in the source +key. The GPU lock was not touched in producing this document (CPU-only: +artifact-read + write).* diff --git a/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md new file mode 100644 index 000000000000..31cbd9d5f43c --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md @@ -0,0 +1,1896 @@ +# vLLM Parity Lever Map + +> Auto-generated from the parity-exploration workflow. Working artifact (the multi-week path to vLLM parity on prefill + decode, Qwen3.6 NVFP4 / GB10). + +## 1. Prefill gap re-audit + +I have walked the full prefill forward pass against the committed numbers (final_benchmark.csv, PREFILL_GEMM_SCOPE/RESULTS, the 0042 dense nsys profile, the qwen35moe/delta-net graph source). Here is the re-audit. + +--- + +# PREFILL gap re-audit - Qwen3.6 NVFP4 on GB10 + +## Grounding (what the gap actually is) + +From `docs/final_benchmark.csv`, prefill (S_PP, t/s; patched vs vLLM): +- **Dense 27B**: ~922 vs ~1929-2182 → patched is **44-48%** of vLLM. +- **MoE 35B-A3B** (the decision model): ~1510-2177 vs ~5186-6223 → patched is **29-41%** of vLLM. In us/tok at npl64: llama ~471, vLLM ~169 → **gap ~302 us/tok**. + +The GEMM scope's bucket (~232 us/tok llama vs ~68 vLLM) = a **164 us/tok** GEMM difference = **~51-54% of the gap**, and GEMM is **~49% of the llama prefill wall** (232/471). GDN is cited at **~17% of the gap** (vLLM chunked scan ~2.5x cheaper). So GEMM+GDN ≈ **~68% of the gap** by the existing framing - leaving ~30% that the two levers' headline numbers do not name. This audit walks every op to place that residual. + +Important structural facts confirmed from source (`models/qwen35moe.cpp`, `delta-net-base.cpp`, `llama-graph.cpp`): +- MoE = 40 layers (interval-4 → **30 GDN + 10 full-attention**), 256 experts top-8, **plus a dense shared expert on every layer**. Dense = 64 layers (48 GDN + 16 attn). +- **Default prefill GDN is NOT a single kernel.** `fused_gdn_ch`/patch-0031 is default-OFF, so prefill runs `build_delta_net_chunking` - a long graph of `ggml_mul`/`mul_mat`/`solve_tri`/`cumsum`/`tri`/`exp` + many `ggml_cont`/`transpose`/`pad`/`repeat` layout copies + a host-side per-chunk loop. The GDN lever (tensor-core fused kernel) is scoped to replace this **entire** decomposition, so the "11% k_bin_bcast op_mul gating muls" the 0042 patch calls "a separate lever" are in fact **inside the GDN bucket** (a fused GDN kernel subsumes them). + +## Prefill op-share table (MoE decision model; % of the patched/llama prefill wall) + +Estimates triangulated from the committed numbers (232/68 GEMM, 11%/5% from the 0042 dense nsys, the gap arithmetic), not a fresh nsys run. + +| Op (prefill) | ~% of llama wall | vLLM faster? why | Covered by GEMM lever | Covered by GDN lever | +|---|---:|---|:---:|:---:| +| Token embed (`get_rows`) | <1% | tie | - | - | +| **NVFP4 weight GEMMs** total | **~49%** | **Yes** - vLLM W4A16-Marlin/cutlass large-M tiles + async pipeline vs MMQ small-tile / new FP4-MMA at 57.7% of peak | **YES** | - | +| ┝ routed-expert grouped GEMM (gate_up+down, `mul_mat_id`) | ~28% | yes (biggest single bucket) | yes | - | +| ┝ shared-expert dense GEMMs (all tokens, ×40) | ~9% | yes | yes | - | +| ┝ GDN in/out projections (wqkv, wqkv_gate, ssm_out) | ~7% | yes | yes | - | +| ┝ attention QKV/O projections (×10) | ~5% | yes | yes | - | +| **GDN chunked decomposition** (30 layers) | **~22%** | **Yes** - vLLM chunked scan ~2.5x cheaper (tensor-core intra-chunk vs llama's f32 graph ops + layout copies + host loop) | - | **YES** | +| ┝ gating/decay muls (`k_bin_bcast op_mul`) | ~11%* | yes | - | yes (fused kernel absorbs) | +| ┝ small f32 mul_mats + `solve_tri` + cumsum/tri/exp | ~7% | yes | - | yes | +| ┝ layout `cont`/`transpose`/`pad`/`repeat` copies | ~4% | yes | - | yes | +| **FlashAttention prefill** (QK^T·softmax·PV, 10 layers) | **~3-6%**† | maybe - L²-growing; bounded at npp=128, larger at serving context | **NO** | **NO** | +| **MoE router + combine/scatter** | **~5-8%** | **Yes** - vLLM fuses gather/weight/scatter into the grouped-GEMM epilogue | **NO** | **NO** | +| ┝ `argsort_top_k`(256→8) + softmax + weight-norm | ~2-3% | yes | no | no | +| ┝ combine: 7× fp32 `add` + weight `mul` (×40) | tested flat in Phase 7 | yes | no | no | +| **Activation quantization** (W4A4 e4m3 pass per GEMM) | **~3-6%** | **Yes - structurally**: vLLM W4A16-Marlin on GB10 has **no** activation-quant step | **NO**‡ | partial | +| Norm + residual tail (attn/post/q/k/ssm/l2/out + adds) | ~4% | small (0042 fused the main one) | - | - | +| RoPE + sigmoid/silu gates + scale | ~2-3% | small | - | - | +| LM head (last-token only in prefill) | <1% | tie | - | - | + +\* 0042 dense profile; in MoE the relative share is a bit lower (MoE FFN is heavier). † grows quadratically - under-weighted at the benchmark's npp=128; re-measure at real serving lengths. ‡ the quant pass feeds the GEMM but is a *separate kernel*, not inside the GEMM-lever's mul_mat bucket. + +## Verdict: GEMM + GDN are the two dominant buckets but NOT the whole gap + +They cover ~71% of the prefill wall and the bulk of the gap. Three contributors are **materially uncovered** by either lever: + +### Newly-identified lever 1 - MoE router + combine/scatter (the strongest miss on the decision model) +llama runs the expert routing and recombination as **separate memory-bound ggml ops**: `argsort_top_k` over 256 experts, softmax/normalize, then a fan-in of **7 fp32 `ggml_add` + a weight `ggml_mul`** per MoE layer (`llama-graph.cpp` ~1797-1824), every one of 40 layers. vLLM's fused-MoE (and Marlin grouped) path folds the gather, the router-weight multiply, and the scatter-accumulate into the **GEMM epilogue/prologue** - so this is overhead vLLM essentially does not pay. Est. ~5-8% of the MoE prefill wall, entirely outside GEMM (the `mul_mat_id` is covered; the surrounding argsort/adds/mul are not) and outside GDN. + +Phase 7 challenged the smallest version of this lever: a CUDA-only post-down +weighted-combine fusion that removed the separate router-weight `mul` plus +rank-order add fan-in while preserving md5. It passed `MOE_WEIGHTED_COMBINE` +`7/7`, `MUL_MAT_ID` `806/806`, and canonical MoE/dense md5 gates; Nsight proved +the fused kernel launched (`110` `k_moe_weighted_combine` calls). Serving A/B was +flat (`decode_agg_tps 417.5 disabled -> 417.0 fused`), so the fan-in-only patch +was rejected. The remaining plausible lever is a larger fused-MoE +prologue/epilogue that also removes gather/scatter or moves work into the GEMM +kernel, not another standalone fan-in fusion. + +Phase 8 scopes that remaining lever as profile-gated ragged serving dispatch: +first measure llama.cpp and vLLM at `n=128`, `ptok=128`, `gen=64` and bucket +`mm_ids_helper`, activation quant/gather, grouped MMQ, and scatter/writeback. Do +not implement a fused routed-expert `MUL_MAT_ID` dispatch path unless those rows +are material in live serving and not dominated by GDN or FA. + +### Newly-identified lever 2 - the W4A4 activation-quant pass (a vLLM-asymmetry, not just a kernel-speed gap) +Every NVFP4 GEMM (MMQ today, and the new 0034 FP4-MMA) **quantizes activations to e4m3 (amax/6 + code search) before the matmul** - a distinct, M-proportional kernel. vLLM on **sm_121 falls back to W4A16-Marlin** (the TENSORCORE_GDN_SCOPE confirms this: no tcgen05/cutlass-FP4 on GB10), i.e. **f16 activations, zero activation-quant**. So this pass (~3-6% of prefill) is a structural cost vLLM avoids, and it explains part of why even a peak FP4-MMA GEMM will not fully reach vLLM's prefill. The README's "act-quant FLAT" and "W4A16 rejected" verdicts are **decode/BW-bound findings**; in compute-bound prefill the trade is different and unaudited. **Lever: measure this quant bucket as its own nsys row; consider fusing the activation-quant into the GEMM prologue (cp.async + in-register quant) so it is not a separate global-memory pass.** + +### Flag 3 - FlashAttention prefill (context-dependent, currently under-measured) +The 10-16 full-attention layers' QK^T·softmax·PV is a separate kernel covered by neither lever. It is small at the benchmark's npp=128 but **grows as L²**; at the long contexts the decode-serving work targets it can become a real bucket. The whole prefill ground-truth (232/68) was taken at one ubatch size - **re-profile FA share at the real serving prefill lengths** before assuming it is negligible. + +### Confirmed inside the existing levers (not new) +- The 0042 "11% gating muls" and all the GDN small-matmuls/`solve_tri`/cumsum/layout-conts are **inside the GDN bucket** - the tensor-core GDN kernel subsumes them; they are only "live and uncovered" *today* because patch 0031 is default-off and losing at C=16. +- Shared-expert dense GEMMs, GDN/attention projections = **GEMM lever** (the FP4-MMA 0034 path already routes them). + +## Bottom line +Two prefill levers (GEMM, GDN) are correctly the top-2 and own ~the gap's majority, but they are **not** the whole gap. The op-walk surfaces **MoE router+combine/scatter** and the **W4A4 activation-quant pass** as genuine, currently-untracked prefill contributors on the MoE decision model (~8-14% combined), plus **FA prefill** as a context-dependent risk the npp=128 bench hides. Per the methodology, step 0 is an nsys prefill-only window that explicitly breaks out `argsort/add(combine)`, `quantize_mmq_nvfp4`, and `flash_attn` as separate rows to size these three before funding a kernel. + +Phase63 executed that step-0 discipline after the W4A16 direct-A and MTP +rejections. It stayed profile-first and inference-gated: pre/post canonical md5 +and backend-op gates wrapped same-shape llama.cpp/vLLM prefill profiles at +`npp/PT=512` and `2048`. Result: FA is not a source lever on GB10 right now. +llama.cpp FA was `0.71%` at `npp=512` and `1.18%` at `npp=2048`; the +`npp=2048` cross-engine FA delta was about `1.7 us/tok`. The paged +FlashAttention mask/block-table cleanup remains a correctness/test gap worth +keeping in mind, but Phase63 rejects it as a parity patch. + +Phase64 then attributed the remaining `layout-copy` bucket with default-off +`LLAMA_LAYOUT_TRACE=` in fork commit `fa944bb5f`. The trace showed the +layout bucket is a mix of GDN conv-state materialization, MoE top-k fan-in +gathers, and paged-attention mask/KV reshape/copy paths. It did not expose a +single low-conflict projection/layout shortcut; use the Phase64 names before +funding any Phase65 source work. + +Phase65 attributed the activation-quant bucket with default-off +`LLAMA_QUANT_TRACE=` in fork commit `afc2c7030`. The default MoE prefill path +emitted `mmq_dense 4444`, `mmq_moe_dedup_unique 2960`, `mmq_moe_gather 2960`, +and `mmq_moe_flat 1480` trace lines at `npp=512`. The named paths are MoE +gate/up expert quant dedup plus gather, MoE down expert flat quantization, and +shared-expert dense quantization. Do not optimize from counts alone; Phase66 +should time `quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX first. + +Phase66 ran that timing pass. At MoE `npp=512`, total GPU kernel time was +`7108388986 ns`; `quantize_mmq_nvfp4` was `317205504 ns` (`4.46%`), +`gather_mmq_fp4` was `45374880 ns` (`0.64%`), combined `5.10%`. Reject a +gather/quant shortcut on GB10 for now: the gather is not material and the +combined route is below the `8%` source-funding threshold. + +Phase67 tested the `bf16-proj` conversion half directly. Fork commit +`ea0875d14` adds default-off `LLAMA_BF16_CUBLAS_F32_OUT=1`, letting BF16 cuBLAS +write F32 output instead of writing BF16 then launching a BF16-to-F32 conversion. +It passed MoE/dense md5 and `MUL_MAT 1146/1146`; MoE prefill improved +`2347.41 -> 2402.34` at `npp=512` and `2440.18 -> 2456.54` at `npp=2048`. +Keep it default-off until dense and serving A/B decide whether it is worth a +default policy change. + +Phase68 ran that dense and serving A/B without changing source. Dense prefill +was positive but tiny (`973.13 -> 975.52` at `npp=512`, `1019.88 -> 1021.39` at +`npp=2048`). A small MoE serving window at `N=128`, prompt `128`, generation +`128` also moved in the right direction: aggregate `409.8 -> 415.0`, +decode aggregate `615.3 -> 627.2`, mean TTFT `8574.7 -> 8085.9 ms`, wall +`39.978 -> 39.480 s`. Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off +but worth carrying as an opt-in shortcut candidate. Do not default it on until +the fork commit is mirrored into the LocalAI patch series and a broader serving +snapshot passes pre/post md5 and op gates. + +Phase70 ran that broader serving snapshot. Gates stayed green, but the broader +window rejected default-on: at `N=8`, opt-in aggregate and decode fell to +`0.8896x` and `0.8998x` of default, and mean TTFT worsened to `1.1247x`. +At `N=32` and `N=128`, opt-in slightly widened the vLLM decode gap +(`0.6864x` vs `0.6882x`, and `0.6839x` vs `0.6921x`). Keep +`LLAMA_BF16_CUBLAS_F32_OUT=1` default-off only and move to another lever. + +Phase71 revalidated the current shipped GDN tensor-core default before adding +more GDN source work. Artifact: +`/home/mudler/bench/phase71_gdn_tc_revalidation/20260701_153425`. Canonical +MoE/dense md5 gates matched for default, sequential-disabled, serial-chunked, +and forced M5 modes; `GATED_DELTA_NET` passed `46/46` for each mode, and +default passed `MUL_MAT 1146/1146` plus `MUL_MAT_ID 806/806`. Current default +beat sequential-disabled by `+5.24%`/`+2.61%` S_PP at `npp=512/2048`, beat +serial-chunked by `+29.43%`/`+42.54%`, and forced `GDN_TC=4 GDN_CHUNK_MIN=64` +was within noise of default (`+0.42%`/`-0.10%`). Decision: keep shipped M5 and +do not reopen smaller GDN C32/QS/global-Ai32/kernel-reorder work on GB10. + +Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch`, and the graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/models/{qwen35moe.cpp,delta-net-base.cpp}` + `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/llama-graph.cpp` (build_moe_ffn ~1500-1834, build_attn ~2136-2189). + +## 2. Decode-serving compute hypotheses (ranked) + +RANKED DECODE-SERVING GPU-COMPUTE HYPOTHESES (paged llama.cpp vs vLLM, MoE Qwen3.6-35B-A3B-NVFP4 on GB10) + +Grounding facts that constrain the ranking: +- The gap is empirically MoE-specific: dense static is parity-to-ahead, MoE static is 89-93% of vLLM, but MoE *burst* serving is ~66% (n=128: paged 4.53 vs vLLM 6.87 tok/s/seq). So whatever degrades is on a path that hurts MoE far more than dense. +- It is GPU-compute-bound, NOT host/reuse-bound: padded-shape lever rejected, baseline reuse 0% statistically equal to S1+S3 reuse 72% on aggregate tok/s, hostproc only 4-8% of wall. So the host loop (0040/0041/S2) is closed; the residual lives in per-step kernel time. +- The decode KERNELS tie vLLM at a fixed WIDE lockstep shape (static batched-bench). The serving loss is therefore about how a RAGGED/NARROW/fluctuating live batch (varying decoder count D, ragged KV lengths, ragged token->expert assignment) feeds those same kernels, vs how gracefully vLLM's kernels degrade at the same concurrency. This is exactly the Phase-0 "re-scope" branch in DECODE_SERVING_SCOPE.md ("serving runs a worse effective batch shape into the kernels"). + +Decisive measurement that arbitrates all of these (run first): nsys a clean steady-state serving window (serve_bench staggered ~128 clients through llama-server, LLAMA_KV_PAGED=1 + LLAMA_MOE_FORCE_GRAPHS=1, -fa on -ngl 99) AND the same nsys on vLLM at the same concurrency (both-engine rule). Decompose per-step GPU-kernel-time into buckets {MoE-expert-GEMM (MUL_MAT_ID), full-attn FA, GDN recurrence, bf16 projections, activation-quant, sampling/logits} and compare serving-narrow vs static-wide vs vLLM. The bucket whose per-useful-token time grows MOST going static->serving (relative to vLLM's same bucket) is the gap. Avoid the known window artifact; measure a steady span. Reference doc: backend/cpp/llama-cpp-localai-paged/docs/DECODE_SERVING_SCOPE.md. + +--- + +H1 (TOP) - MoE expert GEMM collapses to per-expert GEMV at ragged/narrow serving width, plus risk of the host-sync sorted per-expert fallback. +- Mechanism: top-8 of 256 experts. Tokens/expert ~= D*8/256. Static npl128 -> ~4 tok/expert; serving burst-tail D->8 -> ~0.25 tok/expert, so most active experts get 0-1 tokens. The grouped MMQ id-GEMM's per-expert M collapses to 1 -> pure GEMV that reads the full FP4 expert weight (memory-bound, weight bytes unamortized) and re-loads per-expert scales. This is the "256 tiny-expert weight bandwidth" README s5 names as the residual. Separately, patch 0025 only keeps CUDA graphs on for the should_use_mmq grouped path; any serving step where MUL_MAT_ID ne[2]>8 (mmvq_mmid_max) AND should_use_mmq returns false falls to the per-expert host-loop fallback that cudaStreamSynchronizes per expert (the [TAG_MUL_MAT_ID_CUDA_GRAPHS] disable) - catastrophic, and serving's varying per-step shapes can trip it unevenly. +- Why slower than vLLM: vLLM runs ONE fused MoE GEMM with sorted_token_ids/expert_ids computed on-GPU (fused_moe / Marlin-MoE), a single persistent launch that keeps the grouped GEMM dense and amortizes launch + scale loads; it degrades gracefully at small M. llama issues a grouped MMQ that, at ragged narrow width, is many near-empty expert tiles each re-reading scales, and can drop to a host-synced loop. +- nsys metric to confirm: (a) MUL_MAT_ID kernel-time as % of per-step GPU wall, static-wide vs serving-narrow vs vLLM; (b) the tokens-per-expert (M) distribution per step - look for M->1 GEMV collapse and achieved FLOP/s vs M; (c) count cudaStreamSynchronize / per-expert cudaMemcpy *between* MUL_MAT_ID launches per step (host-sync fallback firing); (d) vLLM single fused-MoE kernel duration at same concurrency. +- Candidate fix: a fused grouped-NVFP4 MoE decode GEMM with on-GPU token sorting (device-computed sorted token offsets + expert ids) so all active experts share one persistent launch and scales amortize - i.e. port vLLM's fused-MoE dispatch shape onto the FP4-MMA MMQ id-path; as a floor, extend 0025 to GUARANTEE the grouped should_use_mmq path for every serving shape so the host-sync loop never fires. Bit-exact-gateable (graph-replay/grouped path re-issues identical kernels). + +H2 - Paged full-attention decode kernel: ragged-KV load imbalance, no tensor cores, indirect block-table reads. +- Mechanism: the 16 full-attn layers run the paged block-table FA decode, pinned by the 0010/0011 dispatch guard to vec/tile and NEVER the mma/wmma tensor-core FA (a present block table routes only to vec/tile; tile loads half2, F16 cache only). Static bench: all sequences one KV length -> balanced. Serving: KV lengths are ragged (each request at a different position), so per-sequence attention work is imbalanced across the grid and the step waits on the longest-context tail; there is no KV-dimension split. Every K/V access is an indirect physical-cell load via the block table (gather-like), less coalesced than a contiguous read. +- Why slower than vLLM: vLLM PagedAttention v2 uses a split-K / partitioned reduction designed for ragged long contexts (flash-decoding style) that balances work and lifts occupancy on the tail, and keeps the contiguous-within-block layout. llama's vec/tile paged read has no KV split and leaves tensor cores idle on the full-attn layers. +- nsys metric to confirm: FA-decode (vec/tile) kernel duration vs KV-length VARIANCE across the live batch (does it scale with max-KV/tail rather than mean-KV?); tensor-core-active-% during FA layers (expect ~0); achieved memory-BW of the FA kernel under ragged KV; vLLM paged-attn kernel time + util at same concurrency. +- Candidate fix: a KV-split (flash-decoding / split-K) paged FA decode so long sequences are partitioned across blocks for balance + occupancy; longer term a tensor-core paged FA for the full-attn layers (mma.sync down-translation, same approach as the GDN tensor-core scope). At minimum a per-sequence work-balanced launch. + +H3 - GDN/SSM recurrence decode kernel under-occupied at narrow/variable serving width. +- Mechanism: patch 0022 tuned the recurrence (NUM_WARPS=16, COLS_PER_WARP=8, grid.z = S_v/(NW*CPW)) for the WIDE B=128 lockstep batch; its DRAM-latency coverage / MLP needs ~128 independent sequence-states in flight, and it is bandwidth-bound (re-streams the 128x128 f32 state per sequence per step at 84.6% of peak BW *at B=128*). In serving D fluctuates and collapses in the burst tail; at low D the kernel is grid-starved (few independent states), achieved-BW falls below the tuned point and per-token state traffic rises - the same grid-starvation failure mode the chunked-prefill kernel hit at low n_seqs. Plus the serial-SSM host loop (README s2d/s5 structural floor) is amortized over fewer tokens. +- Why slower than vLLM: vLLM's fused_recurrent_gated_delta_rule + its scheduler keep the recurrence fed at small batch; llama's fixed B=128-tuned launch params under-saturate when D is small. +- nsys metric to confirm: gated_delta_net kernel achieved-BW (GB/s) and occupancy as a function of live D in serving vs the static 84.6%@B128 baseline; recurrence kernel time/token vs D; grid occupancy at the burst tail. +- Candidate fix: width-adaptive recurrence launch params - auto-select NUM_WARPS/COLS_PER_WARP (already env GDN_NW/GDN_CPW) by live D so the grid stays saturated at narrow width; bit-exact-safe (0022's column assignment is provably independent of visit order). Longer term the chunked/register-resident state scan cuts state traffic. + +H4 - Continuous-batch ragged-shape overhead: every kernel sized to the batch union/max; bf16 projections become GEMV at narrow D (umbrella + the "bf16-projection bandwidth" half of README's stated residual). +- Mechanism: ragged positions/lengths/expert-assignments mean each per-step kernel is launched for the max/union over the live batch, so useful-token efficiency < lockstep. This is the shared root of H1-H3 but is worth isolating because it also covers the q/k/v/gate/o projections (deliberately kept bf16, per README s5) which at narrow D become GEMV-like memory-bound weight reads - the "bf16-projection bandwidth" residual vLLM also pays but amortizes over a steadier batch. +- Why slower than vLLM: vLLM's scheduler holds a steadier/denser decode batch (padded bucketed decode + chunked-prefill interleave) so its projection/attn GEMMs run at higher effective M; llama's batch width fluctuates more. +- nsys metric to confirm: GPU-busy% in a steady serving window vs static (expect lower in serving) and (sum useful-token FLOPs)/(kernel-time) serving vs static; bf16 projection GEMM achieved FLOP/s vs M (GEMV collapse at small D). +- Candidate fix: largely subsumed by fixing H1-H3 at the kernel level. Note: holding D high via admission was effectively probed by the padded-shape lever and REJECTED for throughput (the completion-driven shrink is itself a per-survivor win); so do NOT re-pursue width-padding - the payoff is in the per-kernel fixes. + +H5 - Per-step sampling + logits handling across D independent sequences (low, cheap to exclude). +- Mechanism: each live sequence has its own sampler chain run after logits land; at narrow D this fixed per-step cost (+ any D2H logits copy) is amortized over fewer tokens. vLLM batches sampling on-GPU across the whole decode batch. +- nsys metric to confirm: sampling/logits-copy time as % of per-step wall serving vs static; D2H logits cudaMemcpy size+time; count of per-sequence sampler launches. +- Candidate fix: single on-GPU batched sampler over [D, vocab], no per-sequence D2H. Likely small on the greedy/temp0 path; profile mainly to exclude. + +H6 - NVFP4 activation (re)quantize fixed per-step cost at small M (lowest; likely a wash vs vLLM, no bit-exact lever). +- Mechanism: W4A4 quantizes activations to FP4 each step; 0023 de-dups the per-expert re-quant but the base quantize + scale compute is a per-row pass amortized over fewer tokens at narrow D. README lever-3 records the act-quant tax is removable only by W4A16 (rejected) or a kernel rewrite. +- nsys metric to confirm: quantize-kernel time as % of step vs M, serving vs static, vs vLLM (which also FP4-quantizes activations -> expected wash). +- Candidate fix: none bit-exact; include only so the profile can rule it out. + +--- + +Priority for the profiling workflow: profile the MoE model (the gap model) first; bucket per-step GPU-kernel-time and A/B serving-narrow vs static-wide vs vLLM. Expect the ranking to be dominated by H1 (MoE expert GEMM at collapsed M + any host-sync fallback) and H2 (ragged-KV FA on vec/tile), with H3 (recurrence occupancy at narrow D) third; H4 is the shared root, H5/H6 are exclusion checks. The single most decisive metric is the per-step time share of MUL_MAT_ID vs FA vs gated_delta_net in serving-narrow relative to vLLM's equivalents at the same concurrency. + +## 3. vLLM pipeline deep-dive + +I have what I need. Synthesizing the vLLM pipeline against our patch series. + +--- + +# vLLM vs `llama-cpp-localai-paged`: kernel + scheduler deep-dive (Qwen3-style hybrid SSM+attn MoE, NVFP4, consumer Blackwell sm_121) + +## The single most important grounding fact (changes how every GEMM row reads) + +On **consumer Blackwell (sm_120/sm_121: DGX Spark/GB10, RTX 5090, RTX PRO 6000)** the native CUTLASS NVFP4 grouped-GEMM path is **broken** (TMA-WS grouped-GEMM init failure, CUTLASS #3096) and there is **no `tcgen05`/TMEM**. So vLLM on *this exact hardware* does **not** run a native FP4-MMA grouped GEMM - it **falls back to the Marlin BF16 kernel that dequantizes FP4->BF16 in-register**, capped at bf16-tensor-core peak (~half FP4 peak). Native FP4 (W4A4/tcgen05) and the best FlashInfer/TRT-LLM kernels are gated to **data-center Blackwell sm_100a**. This means several "vLLM advantages" assumed for B200 do **not** hold on GB10, and our native FP4-MMA path (the just-verified 103 TFLOP/s = 57.7% of FP4 peak GEMM) is potentially *ahead of* vLLM's Marlin-bf16 fallback on this part - the opposite of the usual framing. + +## Comparison table + +| # | Component | vLLM (this model class, sm_121 reality) | Ours (`llama-cpp-localai-paged`) | Regime | Verdict / gap | +|---|---|---|---|---|---| +| 1 | **Dense weight GEMM - decode** (M≤128, BW-bound) | Marlin FP4→bf16 in-register dequant (W4A4 broken→fallback); reads 4-bit weights | Native FP4-MMA MMQ (FP4 wt × Q8_1 int8 act), M≤128 tile | decode | **Parity** - both at FP4 weight-BW floor. Ours ~96-97% of vLLM, ahead at low concurrency | +| 2 | **Dense weight GEMM - prefill** (large-M, compute-bound) | Marlin grouped/dense, async cp.async pipeline, big tiles, ~bf16 peak | MMQ small-tile, 1 CTA/SM. **New native FP4-MMA large-M kernel @103 TFLOP/s being integrated** (beats cuBLAS-bf16, bit-exact) | prefill | dequant→bf16-cuBLAS lever (0033) was **rejected** (MMQ beat it 29-49%); the native FP4-MMA kernel is the real fix and could **beat** vLLM's bf16-Marlin here | +| 3 | **MoE expert GEMM - decode** | Marlin FP4→bf16 grouped, indirect addressing | Grouped MMQ (`mul_mat_id`), sorted expert layout, native FP4-MMA | decode | **Parity** - both BW-floor. Recurrence/GEMM are *our wins*; residual = bf16-projection BW + host loop | +| 4 | **MoE expert GEMM - prefill** | Marlin grouped GEMM, fused, big tiles | MMQ small-tile grouped (1 CTA/SM) | prefill | **GAP (#1 prefill bottleneck per docs).** Native FP4-MMA grouped kernel is the planned fix; today MMQ is small-tile-bound | +| 5 | **MoE routing / gather / scatter / epilogue** | Triton persistent fused-MoE: indirect token addressing, **fused gate+up + SwiGLU epilogue**, once-quantize, scatter+weighted-combine fused | Sorted per-expert layout; **NVFP4 act-quant de-dup (0023)** mirrors once-quantize; SwiGLU is **separate ops** (no fused epilogue) | both | Partial parity. **No fused gate+up+SwiGLU epilogue** (extra IO passes); fan-in-only weighted-combine fusion was Phase 7 tested-flat | +| 6 | **GDN / linear-attn - decode** | FLA Triton `fused_recurrent_gated_delta_rule` + `fused_sigmoid_gating_delta_rule_update` (sequential, per-step state) | Fused sequential recurrence: in-place state write-back (0018), fused state gather (0019), o_proj MMVQ→MMQ (0020), occupancy retune (0022), conv-tap gather fusion (0028) | decode | **Parity-to-win** - recurrence runs at **102.6% of vLLM bandwidth**, 84.6% of GB10 peak BW. Our strongest area | +| 7 | **GDN / linear-attn - prefill** | FLA `chunk_gated_delta_rule`: intra-chunk products on **tensor cores** (UT-transform), ~2.5× cheaper | Tuned **sequential** scan (default); chunked parallel-scan (0031) is **opt-in + ~22% slower** (serial f32 reductions, no TC, C=16 forced by 99KB smem) | prefill | **GAP (#2 prefill bottleneck).** No tensor-core chunked GDN. Scoped (TENSORCORE_GDN_SCOPE, mma.sync only); **Gram products de-risked at 6.7-9.3× over sequential**, kernel not yet built | +| 8 | **Causal conv1d (short conv)** | FLA `causal_conv1d_fn`/`_update` Triton | `ggml_ssm_conv_update_inplace` (0021): 5-op chain → 1 op, in-place ring | both | Parity | +| 9 | **Full-attention - decode** (16 of 64 layers) | FlashInfer / TRT-LLM paged decode (tensor-core, cascade wrapper, FP8-KV capable) | llama.cpp FA `ggml_flash_attn_ext` with **block-table paged read** (src[5]); routed to **vec/tile** kernels | decode | Parity at decode width (vec/tile is right for small batch) | +| 10 | **Full-attention - prefill** (large-M) | FlashInfer/TRT-LLM tensor-core prefill FA | **Forced to vec/tile** (block-table only grafted into vec/tile; mma/wmma FA ignores it, dispatch-guarded off) | prefill | **GAP (secondary).** Paged prefill full-attn gets **no tensor-core FA**. Docs rank it below MoE-GEMM/GDN, so not the dominant prefill term | +| 11 | **Paged KV manager (full-attn)** | vLLM block manager + hybrid KV cache manager (co-sizes attn/linear blocks to equal physical bytes, anti-fragmentation) + auto prefix caching | `PagedKVManager` (FreeBlockQueue/BlockPool/COW), cross-request prefix sharing, burst-reclaim (0024) | both | **Parity** on the attn side; we lack vLLM's *unified* hybrid co-sizing (we manage SSM state separately - see #12) | +| 12 | **Hybrid SSM-state cache mgmt** | Unified hybrid manager pages linear-attn state alongside attn KV | SSM recurrent + conv state in fixed per-seq slots, updated **in-place** (not paged; O(1)/seq) | both | Different approach, not a perf gap (recurrent state doesn't need paging); we lack unified fragmentation accounting | +| 13 | **Sampler** | **GPU FlashInfer sorting-free sampler** (Dual-Pivot rejection sampling, single kernel, no logits sort, ~0 overhead); RejectionSampler for spec-decode | llama.cpp **host-side** sampler chain (CPU partial-sort for top-k/p) | serving | **GAP - NO EQUIVALENT.** Host sampler + D2H logits adds to the per-step host loop at high concurrency (greedy md5 bench hides it) | +| 14 | **Scheduler / continuous batching / chunked prefill** | V1: mixed prefill+decode step, **chunked prefill default-on**, decode-prioritized `max_num_batched_tokens` budget, auto-chunk | `update_slots()` unified step, **decode-first dynamic budget** (0016, `max(n_ubatch,T−D)`), prefill budget (0013), prefix-share (0008) | serving | **Parity** - we match the chunked-prefill + decode-first token-budget design | +| 15 | **CUDA graphs - decode** | **FULL cudagraph**: padded/bucketed decode shapes → 1 persistent captured graph per bucket → steady decode = single `cudaGraphLaunch`, zero host rebuild | S1+S3 (0040/0041) graph **reuse** keyed on bucketed block-table dims + decode-shape-stable scheduling → serving reuse 0%→**72.2%** | serving | **Partial.** We reuse, not full-capture. **Padded/fixed-slot decode (→~100% like vLLM) was built + GPU-tested + REJECTED** - serving decode here is GPU-compute-bound, so dummy-row compute > reuse recovered | +| 16 | **CUDA graphs - prefill** | PIECEWISE cudagraph (default FULL_AND_PIECEWISE) | ggml graph rebuild per prefill step (paged data-ptr churn) | prefill | Gap, low value (prefill is compute-bound; launch overhead amortized over large M) | +| 17 | **Speculative decoding / MTP** | **MTP head + EAGLE-style spec-decode** supported for this model class (Qwen3-Next ships an MTP module) | Opt-in `draft-mtp` path exists and passes rollback safety, but current serving is rejected on GB10 | decode | **GAP - IMPLEMENTED BUT NOT A WIN.** Phase 15/19/62 showed high acceptance with severe serving regression from target-verify/output-row graph cost. Do not enable MTP or tune `n_max` blindly. | +| 18 | **KV-cache dtype** | FP8 KV cache + FP8 attention (halves KV BW) | F16 paged KV | both | Minor gap; partly offset by our overall 1.5-3× lower memory (NVFP4 weights). FP8-KV would cut KV BW further | + +## Gaps where we have NO equivalent (ranked by value) + +1. **Speculative decoding via the MTP head (#17).** Qwen3-Next/3.6 ships a Multi-Token-Prediction module; vLLM exploits it for spec-decode. Phase 9 proved the current fork is no longer at "nothing": Qwen3.5/3.6 `draft-mtp` code exists, the DGX MoE GGUF contains `nextn` tensors, and a short opt-in smoke passes after disabling backend draft sampling for MTP. Phase 14 passed rollback safety, but Phase 15, Phase 19, and Phase 62 rejected current serving MTP on GB10 because high acceptance did not overcome target-verify/output-row graph cost. Do not enable MTP by default. + +2. **Tensor-core chunked GDN prefill (#7).** vLLM's FLA `chunk_gated_delta_rule` pushes intra-chunk Gram products through tensor cores (~2.5× cheaper prefill). Our 0031 chunked kernel is opt-in and 22% *slower* (serial f32 reductions). Scoped (mma.sync-only on sm_121, no wgmma/tcgen05), Gram products de-risked at 6.7-9.3×, kernel not built. One of the two named prefill bottlenecks. + +3. **Large-M native FP4-MMA grouped MoE GEMM (#4).** The #1 prefill bottleneck. vLLM uses Marlin-bf16 grouped (capped at bf16 peak on sm_121); our MMQ is small-tile/1-CTA-bound. The new native FP4-MMA GEMM (103 TFLOP/s, beats cuBLAS-bf16) is the integration that closes this - and because vLLM is bf16-Marlin here, a working native FP4 grouped kernel could *exceed* vLLM on this exact hardware. + +4. **GPU fused sorting-free sampler (#13).** vLLM samples on-device (FlashInfer Dual-Pivot rejection, no logits sort); llama.cpp samples on host. Adds to the serving host loop at 128-way concurrency for top-k/p workloads. No GPU-sampler equivalent in the series. + +5. **Fused MoE SwiGLU epilogue (#5).** vLLM fuses gate+up+SwiGLU into the grouped-GEMM epilogue (fewer IO passes). We have the act-quant de-dup (0023) but run SwiGLU as separate ops. Prefill-relevant, decode-minor. + +6. **Tensor-core FA for the paged prefill full-attn path (#10).** Paged forces vec/tile (mma FA ignores the block table). Secondary - docs rank it below #2/#3 in the prefill budget. + +7. **FP8 KV cache / FP8 attention (#18).** Minor; partly offset by our NVFP4 memory lead. + +## Where we are at or ahead of vLLM (not gaps) + +- **GDN decode recurrence (#6):** 102.6% of vLLM bandwidth - our fusion series (0018-0022, 0028) is the strongest area. +- **Decode weight GEMMs dense+MoE (#1, #3):** at the FP4 weight-BW floor = parity; dense ahead at low concurrency. The residual MoE serving gap (~66% at n=128 burst) is a **GPU-compute** gap (vLLM's MoE decode kernel+scheduler ~1.3× on aggregate), **not** a host-loop gap that a graph-reuse/padding lever can close (padded-shape lever proved this, rejected 2026-06-28). +- **Memory:** 1.5-3× lower than vLLM (NVFP4-resident, no persistent bf16 dequant copies). +- **Scheduler design (#14):** chunked-prefill + decode-first budget matches vLLM's V1 model. + +## Net assessment + +Our **decode kernels are at parity-to-ahead** (GDN recurrence, both FP4 GEMMs at BW floor) - confirmed in the kernel regime. The two real, *named-in-docs* **prefill** gaps (MoE grouped GEMM #4, tensor-core chunked GDN #7) are being actively closed with the native FP4-MMA GEMM + the de-risked tensor-core Gram products; on consumer Blackwell specifically these can match-or-beat vLLM because vLLM is itself on a **bf16-Marlin fallback**, not native FP4. The remaining structural gap with no equivalent in the series is the **GPU fused sampler** (serving host-loop, secondary). MTP is no longer absent, but current GB10 serving MTP is rejected until a target-verify/output-row graph-cost design exists. The serving-decode residual is GPU-compute-bound (not host/graph-reuse), so vLLM's edge there is its faster MoE decode kernel + scheduler, not something a host-side lever recovers. + +--- + +Files read (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/README.md`, `docs/DECODE_SERVING_SCOPE.md`, `docs/PREFILL_GEMM_SCOPE.md`, `docs/PREFILL_GEMM_RESULTS.md`, `docs/TENSORCORE_GDN_SCOPE.md` (same dir). + +Sources: +- [vLLM Now Supports Qwen3-Next (FLA Triton kernels, hybrid KV manager, MTP)](https://blog.vllm.ai/2025/09/11/qwen3-next.html) +- [CUTLASS #3096 - SM120 NVFP4 MoE grouped GEMM broken, FlashInfer/Marlin fallback](https://github.com/NVIDIA/cutlass/issues/3096) +- [vLLM Quantization Kernels (NVFP4 W4A16/W4A4, Marlin, Machete)](https://deepwiki.com/bytedance-iaas/vllm/11.4-quantization-kernels) +- [SM120 NVFP4 MoE perf report - Marlin bf16 fallback on consumer Blackwell](https://discuss.vllm.ai/t/sm120-rtx-pro-6000-nvfp4-moe-performance-report-qwen3-5-397b/2536) +- [vLLM Attention Backends - FlashInfer/TRT-LLM default on Blackwell](https://docs.vllm.ai/en/latest/design/attention_backends/) +- [vLLM FLA fused_recurrent_gated_delta_rule](https://docs.vllm.ai/en/latest/api/vllm/model_executor/layers/fla/ops/fused_recurrent/) +- [vLLM Fused MoE Kernel Features](https://docs.vllm.ai/en/latest/design/moe_kernel_features/) +- [vLLM scheduling - chunked prefill, decode-first budget, FULL_AND_PIECEWISE cudagraph](https://docs.vllm.ai/en/stable/configuration/optimization/) +- [FlashInfer sorting-free GPU sampling (Dual-Pivot rejection)](https://flashinfer.ai/2025/03/10/sampling.html) +- [vLLM #11394 - FlashInfer sampling kernel in V1](https://github.com/vllm-project/vllm/pull/11394) +- [vLLM #42960 - batch-invariant GDN_ATTN for Qwen3-Next/Qwen3.6](https://github.com/vllm-project/vllm/issues/42960) + +## 4. Novel levers + +I've grounded myself in the four scope docs, the README patch table + benchmarks (final_benchmark.csv), the methodology doc, and the 0034 FP4-MMA / 0042 fused-residual patch headers. Verified state: prefill is the biggest gap (dense ~920 vs vLLM ~2000 t/s ≈ 44-46%; MoE ~2177 vs ~5300-6223 ≈ 35-41%); decode kernel at parity; serving decode ~65% and measured GPU-compute-bound (host/graph-reuse + padded-shape proved neutral-or-worse). Already-explored/rejected: dequant→bf16 cuBLAS (0033, rejected), bf16-tau (dropped), NVFP4 projections (KL-fail), W4A16-Marlin (rejected), graph coverage (flat), act-quant fusion on decode (flat), padded-shape decode (rejected). Below are levers that go beyond those. + +--- + +# Candidate-lever brainstorm: closing the vLLM gap (paged Qwen3.6 NVFP4, GB10 sm_121a) + +Organized by where the verified gap actually is. For each: mechanism / expected gain / gate (bit-exact vs KL) / risk / effort-reward. "Profile-gated" = run Phase-0 nsys before building, per the methodology. + +## A. PREFILL (the largest gap, 35-46% of vLLM) — highest reward bucket + +### A1. Graph-safe ragged grouped FP4-MMA MoE kernel (remove the per-expert host-sync loop) +- **Mechanism:** 0034 lands the native FP4-MMA dense kernel but routes MoE prefill through the *per-expert host-sync loop* (a `cudaStreamSynchronize` per expert per layer — e.g. dozens-to-hundreds of syncs/layer). Replace it with ONE ragged/grouped FP4-MMA launch over the existing `expert_bounds`/`ids_dst` sorted layout (variable M per expert, single kernel). This is the follow-up 0034 itself flags. +- **Gain:** HIGH. MoE expert GEMM is named the #1 prefill cost; this both removes the serial host syncs and unlocks kernel overlap + graph capture. The single biggest remaining prefill lever after 0034. +- **Gate:** bit-exact by construction (same FP4 math, same K-order as the per-expert path) → greedy md5. +- **Risk:** medium-high (ragged tiling + boundary handling, graph-safety). +- **Effort/reward: HIGH effort / HIGH reward.** The flagged 0034 follow-up; rank #1 for prefill. + +### A2. Multi-stream expert dispatch (cheap stepping-stone to A1) +- **Mechanism:** before writing the full ragged kernel, run the independent per-expert FP4-MMA GEMMs on N CUDA streams instead of the serial host-sync loop, overlapping their LPDDR5x weight reads + tensor-core work. +- **Gain:** medium (partial overlap; recovers some of the serial-sync stall without the kernel rewrite). +- **Gate:** bit-exact (same kernel, reordered launches) → greedy md5. +- **Risk:** medium (stream/event mgmt, not graph-safe — prefill isn't graph-replayed so OK). +- **Effort/reward: LOW-MED effort / MED reward.** Bank this before A1. + +### A3. Fuse MoE router → token-gather/scatter → GEMM (permutation fusion) +- **Mechanism:** vLLM/SGLang fuse routing→permute→grouped-GEMM→unpermute. Here the activation gather (into the sorted-expert layout) and the scatter-back are separate memory passes. Read activations through `ids_dst` in the GEMM prologue and write through the inverse permutation in the epilogue → removes two full activation memory passes per MoE layer. +- **Gain:** medium for prefill (large activation tensor); smaller for decode (0019/0028 already fuse the decode gather). +- **Gate:** bit-exact (index indirection only, same values) → greedy md5. +- **Risk:** medium (epilogue indexing correctness). +- **Effort/reward: MED / MED.** Pairs naturally with A1's kernel. + +### A4. Fused MoE FFN (up_proj → SiLU → down_proj, intermediate register/shared-resident) +- **Mechanism:** keep the per-expert intermediate activation in shared/registers across up→act→down instead of round-tripping it to global. For large-M prefill the intermediate is big → a real BW save; also helps decode. +- **Gain:** medium-high (removes one full intermediate read+write per expert per layer). +- **Gate:** bit-exact if SiLU + accumulation order preserved → greedy md5 (else KL-gate). +- **Risk:** HIGH (fused FP4 FFN kernel is complex; register pressure on sm_121a). +- **Effort/reward: HIGH / MED-HIGH.** Strong but expensive; sequence after A1. +- **Phase 7 shortcut rejected:** fusing only SWIGLU into the NVFP4 + down-input quantization while reusing grouped-MMQ passed the focused op gate + (`MOE_SWIGLU_DOWN 7/7`) but changed paged-MoE md5 under opt-in + (`07db32c2...` vs canonical `8cb0ce23...`) and was flat in serving A/B + (`decode_agg_tps 657.1 → 667.4`, `decode_perseq_tps 3.92 → 3.88`). + Do not retry that partial fusion without a KL gate and a stronger profile + bucket. A real A4 remains a different, larger register/shared-resident FFN + kernel. + +### A5. Activation-quant fusion into the 0042 residual/RMSNorm epilogue (prefill) +- **Mechanism:** the README's "act-quant fusion FLAT" verdict was *decode-only*. For prefill the W4A4 activation-quantize pass is a bigger tensor. 0042 already fuses residual-add+RMSNorm+mul; extend its epilogue to emit the FP4-quantized activation the next GEMM consumes, removing a dedicated act-quant read+write. +- **Gain:** low-medium for prefill. +- **Gate:** bit-exact (same `quantize_mmq_nvfp4` math, just fused) → greedy md5. +- **Risk:** medium (epilogue + the FP4 codepath coupling). +- **Effort/reward: MED / LOW-MED.** Cheap-ish add-on once 0034/A1 are in. + +### A6. Stream-K / split-K for the FP4 prefill GEMM (SM occupancy on few-SM GB10) +- **Mechanism:** GB10 has relatively few SMs. For layers whose output grid (⌈M/128⌉×⌈N/128⌉) is smaller than the SM count, SMs idle. Stream-K splits the K dimension across CTAs with a reduction, keeping all SMs busy. +- **Gain:** medium for small-output-grid layers (profile-gated — only if 0034's grid under-fills the GPU). +- **Gate:** bit-exact if the f32-accumulate reduction order is fixed/deterministic; otherwise KL-gate. +- **Risk:** medium (reduction correctness, workspace). +- **Effort/reward: MED / MED.** Complements 0034; profile first. + +### A7. Prefill CUDA-graph capture (follow-on to A1) +- **Mechanism:** with fixed prefill chunk size (0013/0016 budgets already exist) and A1 removing the host-sync MoE loop, the whole prefill chunk becomes graph-capturable. +- **Gain:** LOW marginal — prefill kernels are large so launch overhead is amortized; the value is mostly *enabling* it (which A1 already does). Record as low-reward, not a standalone lever. +- **Gate:** bit-exact. +- **Effort/reward: LOW / LOW.** Note, don't prioritize. + +## B. DECODE-SERVING (~65% of vLLM aggregate, measured GPU-compute-bound) + +### B1. Speculative decoding, greedy = bit-exact (SSM-state rollback is the crux) ⭐ novel +- **Mechanism:** draft γ tokens (small draft model, or prompt-lookup/n-gram for zero extra weights), verify in one target forward. At **temp=0 the accepted tokens are argmax-identical to non-spec → the greedy md5 gate PASSES by construction** (lossless). This is the rare throughput-multiplier that's bit-exact-compatible. Especially powerful at low concurrency where paged is farthest below vLLM (n=8 burst: paged 28 vs vLLM 45) and the GPU is underutilized. +- **The non-obvious crux:** hybrid-SSM rollback. KV rollback under paged is easy (truncate blocks). But the gated-DeltaNet recurrent state is updated **in-place** (patch 0018), so a rejected draft requires restoring the 128×128 f32 state per layer to the last accepted position — snapshot-before-speculate (memory+BW cost) or recompute. This SSM-state checkpoint/restore is the real engineering risk and is why naive llama.cpp spec-decode plumbing won't transfer. +- **Gain:** HIGH (2-3x at favorable acceptance/low concurrency). +- **Gate:** **bit-exact for greedy** (md5 holds); distribution-preserving (KL-gate) for temp>0. +- **Risk:** HIGH (SSM snapshot/rollback, draft integration with paged KV + recurrent state, acceptance tuning). +- **Effort/reward: HIGH / HIGH.** Biggest novel decode lever; start with zero-draft prompt-lookup to de-risk the rollback plumbing before adding a draft model. + +### B2. FP8 / quantized paged KV cache +- **Mechanism:** decode is BW-bound; quantizing the paged KV (llama.cpp already has q8_0/q4_0 `--cache-type-k/v`) halves the KV-gather BW and **doubles effective KV capacity → higher max concurrency**. Wire the existing quantized-KV FA-vec path through the paged block-table read (0009/0010). Matches a vLLM feature (fp8 KV). +- **Gain:** medium-high for long-context / high-concurrency decode. +- **Gate:** KL-gate (KV quant changes attention numerics; watch long-context recall), per the `8cb0ce23` precedent. +- **Risk:** medium (paged FA-read FP8 path; precision on long context). +- **Effort/reward: MED / MED-HIGH.** + +### B3. Coalesced paged-KV block layout for the in-kernel decode gather +- **Mechanism:** decode is at the LPDDR5x floor, so *effective* BW depends on coalescing. vLLM lays K as `[blocks, kv_heads, head_size/x, block_size, x]` precisely to coalesce the FA read. Re-lay-out the paged blocks so 0009/0010's in-kernel gather issues fully-coalesced vectorized loads matching the FA kernel's access pattern. +- **Gain:** medium (profile-gated: measure the FA-read achieved-BW / sector efficiency first). +- **Gate:** bit-exact (pure memory layout, identical values) → greedy md5. +- **Risk:** medium (touches paged KV manager + FA read). +- **Effort/reward: MED / MED.** Profile before building. + +### B4. Megakernel / persistent decode (single-launch fused decode step) +- **Mechanism:** fuse the per-layer decode ops into one persistent kernel that loops layers internally (à la Mirage/MPK persistent megakernel), eliminating inter-op launch overhead, inter-op global round-trips, and the host loop for the decode step; keep the recurrent state resident across the step. +- **Gain:** potentially high for the GPU-compute-bound serving regime (kills launch/scheduling bubbles vLLM avoids). Honest caveat: at 27-35B the activations don't fit SMEM across layers, so the win is mostly launch-overhead + scheduling, less data-residency. +- **Gate:** in principle bit-exact (same ops/order) but extremely hard to guarantee → realistically KL-gate. +- **Risk:** VERY HIGH (essentially re-implements the decode forward as one kernel). +- **Effort/reward: VERY HIGH / HIGH.** The swing-for-the-fences lever; only after cheaper decode levers are exhausted. + +### B5. Pipeline sampling off the decode critical path +- **Mechanism:** the doc names the "serial-SSM host loop / sampling can't start until logits land" as a floor. S2 (double-buffer set_inputs) was dropped because set_inputs is cheap — but the *sampling stall* between steps is different. Overlap step N's sampling + step N+1's input build with the GPU launch, so the GPU never idles waiting on host sampling. +- **Gain:** medium (recovers the inter-step sampling bubble; this is the precise residual S2 didn't target). +- **Gate:** bit-exact (host reordering only) → greedy md5. +- **Risk:** medium (ordering correctness vs the recurrent in-place state). +- **Effort/reward: MED / MED.** + +### B6. Co-batch chunked prefill INTO decode steps (vLLM-style GPU saturation — flips S3) ⭐ reframe +- **Mechanism:** S3 deliberately keeps prefill *out* of decode steps (for graph reuse). But the later measurement proved serving decode is **GPU-compute-bound, not host-bound** — which *removes S3's rationale*. vLLM does the opposite: mixes small prefill chunks into decode steps to fill otherwise-idle GPU at low decode width. Test co-batching a sized prefill chunk with decode to use spare SMs. +- **Gain:** medium at low-to-mid decode width (better GPU utilization). +- **Gate:** bit-exact (same math, scheduling only) → greedy md5. +- **Risk:** low-medium (it partially contradicts S3 — A/B them; the GPU-compute-bound finding says S3's reuse benefit is ~nil here, so co-batching likely wins). +- **Effort/reward: LOW-MED / MED.** Cheap A/B with high information value (directly tests the regime conclusion). + +### B7. Adaptive-width bucketed decode graph (doc-sanctioned revisit) +- **Mechanism:** the rejected padded-shape lever used fixed pad-to-`--parallel`; the doc explicitly leaves the door open for *adaptive* width (round up to next small bucket 8/16/32/64). +- **Gain:** LOW on GB10 — the same doc measured serving decode GPU-compute-bound, so graph reuse buys ~nothing here. Record as: revisit ONLY if the host loop is re-confirmed dominant on other hardware. +- **Gate:** bit-exact. +- **Effort/reward: MED / LOW (on GB10).** Note, don't build for GB10. + +## C. CROSS-CUTTING / aggregate-throughput reframes + +### C1. Exploit the 1.5-3x memory advantage for higher max concurrency ⭐ reframe +- **Mechanism:** the benchmark stops at npl=128 where both engines fit. With 1.5-3x lower memory (and synergistic with B2 FP8-KV), the paged backend can serve npl=256+ in the same VRAM where vLLM OOMs. Per-stream tok/s gap is irrelevant if paged sustains 2x the concurrent streams per GPU — aggregate tok/s/GPU can match or beat vLLM. +- **Gain:** HIGH for aggregate throughput-per-GPU at the memory ceiling (a legitimate, honestly-labeled "different operating point," not a per-stream parity claim). +- **Gate:** bit-exact (no numeric change) → greedy md5. +- **Risk:** low (scheduler/admission tuning to actually pack the streams). +- **Effort/reward: LOW / HIGH.** Cheapest high-reward lever — measure aggregate at max-concurrency, pair with B2. + +--- + +## Ranked summary (effort vs reward) + +| # | Lever | Regime | Gate | Effort | Reward | +|---|-------|--------|------|--------|--------| +| C1 | Higher max-concurrency via memory advantage (+B2) | aggregate | bit-exact | LOW | **HIGH** | +| A1 | Graph-safe ragged grouped FP4-MMA MoE kernel | prefill | bit-exact | HIGH | **HIGH** | +| B1 | Speculative decode (greedy=bit-exact; SSM rollback crux) | decode | bit-exact (greedy) | HIGH | **HIGH** | +| A2 | Multi-stream expert dispatch (→A1) | prefill | bit-exact | LOW-MED | MED | +| B6 | Co-batch chunked-prefill into decode (flips S3) | serving | bit-exact | LOW-MED | MED | +| B2 | FP8/quantized paged KV cache | decode | KL-gate | MED | MED-HIGH | +| A3 | MoE router+gather+GEMM permutation fusion | prefill | bit-exact | MED | MED | +| B3 | Coalesced paged-KV layout for decode gather | decode | bit-exact | MED | MED | +| B5 | Pipeline sampling off decode critical path | serving | bit-exact | MED | MED | +| A4 | Fused MoE FFN (up+SiLU+down resident) | prefill+decode | bit-exact | HIGH | MED-HIGH | +| A6 | Stream-K/split-K FP4 prefill GEMM | prefill | bit-exact/KL | MED | MED | +| A5 | Act-quant fusion into 0042 epilogue (prefill) | prefill | bit-exact | MED | LOW-MED | +| B4 | Megakernel/persistent decode | decode | KL-gate | VERY HIGH | HIGH | +| A7 | Prefill CUDA-graph capture (→ enabled by A1) | prefill | bit-exact | LOW | LOW | +| B7 | Adaptive-width bucketed decode graph | serving | bit-exact | MED | LOW (GB10) | + +**Suggested attack order:** (1) **C1** — near-free aggregate win exploiting the memory advantage, immediately defensible. (2) **A2→A1** — the prefill MoE GEMM is the biggest single gap and 0034 already flags A1. (3) **B6** — cheap A/B that directly tests/exploits the "serving is GPU-compute-bound" conclusion. (4) **B1** — the highest-ceiling decode lever, but gate the SSM-state rollback plumbing first via zero-draft prompt-lookup. (5) **B2/B3/B5** as the BW + bubble cleanup. (6) **A4 / B4** as the high-effort structural swings only if the cheaper levers leave a funded gap. + +**Two highest-value non-obvious insights:** (a) speculative decoding is *bit-exact under greedy* (md5 passes by construction) — the only throughput-multiplier compatible with the sacred gate — but its hybrid-SSM in-place-state rollback (patch 0018) is the unsolved crux. (b) the measured "serving decode is GPU-compute-bound" finding **invalidates S3's keep-prefill-out rationale** and argues for the *opposite* (B6 co-batching, vLLM-style), plus reframes the win toward aggregate-per-GPU concurrency (C1) rather than per-stream parity. + +Relevant files: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE,PREFILL_GEMM_SCOPE,PREFILL_GEMM_RESULTS,TENSORCORE_GDN_SCOPE}.md`, `.../README.md` (s4 benchmarks, s5 rejected levers), `.../docs/final_benchmark.csv`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (A1 is its flagged follow-up), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (A5 extends it). + +## 5. Synthesized prioritized lever map + +# Prioritized Lever Map - vLLM Parity, Qwen3.6 NVFP4 on GB10 (sm_121a) + +## Bottom line (where the gap actually is) +- **Prefill is the largest absolute gap**: dense ~44-48% of vLLM, MoE (decision model) ~29-41%. Two buckets own ~71% of the wall (NVFP4 GEMM ~49%, chunked GDN ~22%); the op-walk surfaces **three uncovered residuals** (MoE router/combine, prefill act-quant, FA-at-length). +- **Decode kernels are at parity-to-ahead** (GDN recurrence 102.6% of vLLM BW; both FP4 GEMMs at the BW floor). **Decode-*serving* is the still-open gap** (~66% at n=128 burst), is **MoE-specific** and **GPU-compute-bound** (host-loop/graph-reuse/padded-shape all proved neutral-or-worse, so they are closed). +- The two structural levers vLLM has that the series has **no equivalent for**: **MTP speculative decode** and **GPU fused sampler**. On *this* hardware vLLM is itself on a **bf16-Marlin FP4 fallback** (no tcgen05/CUTLASS-grouped), so a working native FP4 path can **match-or-beat** it, not just chase it. + +## Single highest-leverage NEXT action for the still-open decode-serving gap +**Run the both-engine steady-state serving nsys window FIRST (it is the gate before any decode kernel is funded).** Stagger ~128 clients through `llama-server` (`LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 -fa -ngl 99`) and the identical concurrency on vLLM; bucket per-step GPU-kernel time into `{MUL_MAT_ID, FA-vec/tile, gated_delta_net, bf16-projections, act-quant, sampling}` and compare **serving-narrow vs static-wide vs vLLM**. The decisive single metric: the per-useful-token time share of `MUL_MAT_ID` vs `FA` vs `gated_delta_net` in serving relative to vLLM. **Primary hypothesis to confirm/refute: H1** - MoE grouped GEMM collapsing to per-expert GEMV at ragged width, **and** count `cudaStreamSynchronize` *between* `MUL_MAT_ID` launches to catch the per-expert host-sync fallback firing. This one A/B arbitrates D2 vs D3 vs D4 (all HIGH-effort) at once, and the methodology forbids building a kernel before it. **Bank D1 (grouped-path guarantee) immediately as near-free insurance against the host-sync cliff regardless of outcome.** + +## Master ranked lever table (pursue list) + +| # | Lever | Gap | Gain → parity | Effort | Risk | Gate | Dependency / sequence | Status | +|---|-------|-----|--------------|--------|------|------|----------------------|--------| +| 0 | **Phase-0 serving nsys (both-engine bucket A/B)** | decode | enabling - sizes/arbitrates H1-H4 | LOW | low | n/a | none - **do first** | NOT DONE | +| 1 | **X1 (C1) Exploit 1.5-3× memory → serve npl=256+ where vLLM OOMs** | aggregate | **HIGH** (different operating point: aggregate tok/s/GPU) | LOW | low | BE | pairs w/ D6; admission tuning | NOT STARTED | +| 2 | **P1 Native FP4-MMA large-M dense GEMM (patch 0034)** | prefill | **HIGH** - GEMM ~49% of wall; can *beat* vLLM bf16-Marlin | HIGH | med | BE (md5) | foundation for P2/P8 | **IN PROGRESS (0034 scaffold landed)** | +| 3 | **D1 Guarantee grouped MMQ path - never host-sync per-expert fallback (extend 0025)** | decode | **HIGH if firing** (removes catastrophic cliff) | LOW | low | BE | gated by #0; bank regardless | NOT STARTED | +| 4 | **P3 Multi-stream expert dispatch (→P2)** | prefill | MED (partial overlap of serial syncs) | LOW-MED | med | BE | stepping-stone, bank before P2 | NOT STARTED | +| 5 | **P2 (A1) Graph-safe ragged grouped FP4-MMA MoE GEMM** | prefill | **HIGH** - the #1 prefill bucket (~28% of wall) | HIGH | med-high | BE (md5) | after P1/P3; **shares kernel arch w/ D2** | **FLAGGED 0034 follow-up** | +| 6 | **D10 (B6) Co-batch chunked-prefill into decode (flips S3)** | serving | MED (fills idle SMs at low D) | LOW-MED | low-med | BE | cheap A/B; tests "GPU-compute-bound" conclusion | NOT STARTED | +| 7 | **P4 Tensor-core chunked GDN prefill kernel (rewrite 0031)** | prefill | **HIGH** - #2 prefill bucket (~22% of wall, ~17% of gap) | HIGH | med-high | BE→KL | Gram products de-risked 6.7-9.3× | **DESIGN SCOPED, kernel NOT built** | +| 8 | **D2 (H1) Fused grouped-NVFP4 MoE decode GEMM + on-GPU token sort** | decode | **HIGH** - top decode hypothesis (MoE-specific) | HIGH | high | BE | gated by #0; **co-develop kernel w/ P2** | NOT STARTED | +| 9 | **D5 (B1) Speculative decode via MTP head** | decode | **HIGH** (2-3× at low/mid concurrency) | HIGH | high | BE (greedy) / KL (temp>0) | crux=SSM in-place state rollback (0018); de-risk w/ zero-draft prompt-lookup | NOT STARTED | +| 10 | **D6 (B2) FP8 / quantized paged KV cache** | decode | MED-HIGH (halves KV BW; doubles capacity → enables X1) | MED | med | KL (8cb0ce23 precedent) | wire quantized-KV FA-vec through paged read (0009/0010) | NOT STARTED | +| 11 | **D3 (H2) KV-split / flash-decoding paged FA decode** | decode | MED-HIGH (ragged-KV balance + occupancy) | MED-HIGH | med | BE→KL | gated by #0 (build only if FA bucket grows) | NOT STARTED | +| 12 | **P5 (A3+PREFILL-L1) Fused MoE router+gather+scatter+combine** | prefill | MED (~5-8% MoE wall, uncovered by P2/P4) | MED | med | BE (fp32 reorder; 8cb0ce23) | pairs w/ P2 kernel | NOT STARTED | +| 13 | **D4 (H3) Width-adaptive GDN recurrence launch params** | decode | MED (saturate grid at narrow D) | LOW-MED | low | BE (0022 col-independence) | env GDN_NW/GDN_CPW already exists | NOT STARTED | +| 14 | **D7 (B3) Coalesced paged-KV block layout for decode gather** | decode | MED (effective BW / sector efficiency) | MED | med | BE | profile-gated (#0 FA-read BW) | NOT STARTED | +| 15 | **P6 (A4) Fused MoE FFN (up→SiLU→down resident)** | prefill+decode | MED-HIGH (removes intermediate round-trip) | HIGH | high | BE→KL | after P2 | NOT STARTED | +| 16 | **D9 (B5) Pipeline host sampling off decode critical path** | serving | MED (recovers inter-step sampling bubble) | MED | med | BE | ordering vs in-place recurrent state | NOT STARTED | +| 17 | **D8 (H5/#13) GPU fused sorting-free sampler** | serving | MED (small on greedy; matters at 128-way top-k/p) | MED | med | BE-ish | alt to D9; profile to size | NOT STARTED | +| 18 | **P8 (A6) Stream-K / split-K FP4 prefill GEMM** | prefill | MED (small-output-grid layers on few-SM GB10) | MED | med | BE if det. else KL | profile-gated; complements P1 | NOT STARTED | +| 19 | **P7 (A5/PREFILL-L2) Act-quant fusion into 0042 epilogue (prefill)** | prefill | LOW-MED (~3-6% prefill; vLLM avoids it entirely) | MED | med | BE (md5) | extends landed 0042; after P1 | NOT STARTED | +| 20 | **P9 (#10/flag-3) Tensor-core paged prefill FA** | prefill | LOW-MED, **context-dependent (grows L²)** | MED-HIGH | med | BE→KL | re-profile FA share at real serving lengths first | NOT STARTED | +| 21 | **D11 (B4) Megakernel / persistent decode** | decode | HIGH (kills launch/scheduling bubbles) | VERY HIGH | very high | KL | last resort, only if funded gap remains | NOT STARTED | + +Gate key: BE = bit-exact (greedy md5); KL = KL-divergence gate; BE→KL = bit-exact preferred, KL fallback. + +## Drop / closed (do NOT pursue) + +| Lever | Why dropped | +|-------|-------------| +| Padded / fixed-slot decode (pad-to-`--parallel`) | Built, GPU-tested, **REJECTED** - serving decode is GPU-compute-bound; dummy-row compute > reuse recovered | +| B7 Adaptive-width bucketed decode graph | LOW value on GB10 (same GPU-compute-bound finding); revisit only if host-loop re-confirmed dominant on other HW | +| dequant→bf16 cuBLAS prefill (0033) | **REJECTED** - MMQ beat it 29-49%; superseded by native FP4-MMA (P1) | +| W4A16-Marlin / NVFP4 projections (bf16→FP4) | **REJECTED** - KL-fail; vLLM keeps SAME bf16 projections, no advantage to chase | +| bf16-tau | Dropped | +| Act-quant fusion on **decode** (lever-3) | **FLAT** - decode is BW-bound; the prefill variant (P7) is the live one | +| S2 double-buffer set_inputs | Dropped - set_inputs is cheap (host loop closed by 0040/0041) | +| H6 NVFP4 act-quant decode tax | No bit-exact lever; **exclusion check only** (expected wash vs vLLM, which also FP4-quantizes) | +| P10 (A7) Prefill CUDA-graph capture | LOW/LOW - prefill launch overhead amortized over large M; merely *enabled* by P2, not a standalone item | +| H4 ragged-shape umbrella | Not a lever - it is the shared *root* of H1-H3; fixed by D2/D3/D4 at the kernel level | +| H5 (as exclusion) / H6 | profile-only rule-outs, not builds (D8 is the actual sampler lever) | + +## Critical-path sequence (two parallel tracks per the multi-agent GPU methodology) + +**Decode-serving track (gated):** #0 serving nsys → bank #3 (D1) → branch on the dominant bucket: if MUL_MAT_ID-GEMV → #8 (D2); if FA → #11 (D3); if recurrence → #13 (D4). In parallel, cheap A/Bs #6 (D10) and #1 (X1). Highest-ceiling greenfield #9 (D5) once SSM-rollback de-risked via zero-draft prompt-lookup. BW cleanup #10 (D6, synergistic with X1). + +**Prefill track (already moving):** #2 (P1, in progress) → #4 (P3) → #5 (P2) - and **co-develop the P2 ragged-grouped kernel with the D2 decode kernel** (one fused-MoE dispatch that degrades gracefully across M = vLLM's single fused_moe shape). In parallel #7 (P4, design ready). Then the residual-coverage adds #12 (P5), #15 (P6), #19 (P7). Profile-gated #18 (P8), #20 (P9). + +**Two highest non-obvious insights to act on:** (a) the P2 prefill kernel and the D2 decode kernel are the **same kernel** (on-GPU token sort + single persistent grouped FP4-MMA launch) at different M - fund them as one effort. (b) the "serving decode is GPU-compute-bound" finding **invalidates S3's keep-prefill-out rationale** - #6 (D10 co-batching, vLLM-style) and #1 (X1 aggregate concurrency) are the cheap wins that follow from it, and are higher-reward-per-effort than any further host-side or graph-reuse work. + +### Phase 9 MTP update + +Phase 9 adds a narrow MTP smoke gate instead of production enablement: + +- DGX asset check confirmed `qwen35moe.nextn_predict_layers` and + `blk.40.nextn.*` tensors in `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf`. +- Default `draft-mtp` initially ran but emitted backend-sampler errors because + MTP verification batches can request more than one output row per sequence. +- Patch `0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patch` + disables backend draft sampling inside `draft-mtp`. +- After the patch, the default `draft-mtp` smoke exits cleanly with + `n_drafted=5`, `n_accept=4`, and `80.000%` acceptance. +- Canonical inference md5 gates stayed stable: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. + +MTP remains opt-in and, after Phase 15, rejected as a current GB10 serving +throughput lever. It does not supersede the GDN/paged-serving conclusions unless +a future graph/batch-shape fix changes the serving result. + +### Phase 14 MTP rollback update + +Phase 14 closes the safety gap left open by Phase 9, but still does not claim a +throughput/parity win: + +- `test-recurrent-state-rollback` passed on the actual MoE GGUF and logged + `recurrent rollback checkpoint restored successfully`. +- MTP stderr showed bounded recurrent rollback support: + `the context supports bounded partial sequence removal`. +- A partial-rejection run produced `n_drafted=39`, `n_accept=20`, + `accept=51.282%` with no backend sampler multi-output error. +- Canonical inference gates stayed green after the MTP work: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +The greedy-equivalence gate uses normalized raw-output prefix comparison rather +than exact transcript md5 because `llama-speculative-simple` emits accepted +token groups and can produce a longer completion than `llama-completion -no-cnv` +for the same `-n`. Across `n=8,16,24,32,48`, no first differing token was found. + +Phase 15 completed that serving/API benchmark and rejected current MTP serving. + +### Phase 15 MTP serving update + +Phase 15 ran the direct `llama-server` serving A/B that Phase 14 enabled. It +rejects current MTP serving as a parity lever on GB10: + +| arm | n | decode agg t/s | decode per-seq t/s | TTFT mean ms | +|---|---:|---:|---:|---:| +| baseline | 8 | 247.8 | 30.70 | 1181.1 | +| MTP | 8 | 109.8 | 14.26 | 1691.5 | +| baseline | 32 | 406.0 | 12.02 | 2762.2 | +| MTP | 32 | 111.7 | 3.61 | 4545.6 | +| baseline | 128 | 662.4 | 4.31 | 7747.2 | +| MTP | 128 | 138.5 | 0.97 | 20385.7 | + +Artifact: `/home/mudler/bench/phase15_mtp_serving/20260701_042005`. + +MTP did draft and accept tokens (`#gen tokens = 17293`, `#acc tokens = 15493`), +so this is not a no-draft false negative. The likely culprit is graph/batch +shape disruption: baseline logs show heavy graph reuse (`graphs reused = 361` +in the high-concurrency tail), while MTP logs show `graphs reused = 1` and much +higher per-slot eval time. Pre/post canonical inference gates stayed green: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Do not keep tuning MTP draft length blindly. A follow-up must first profile +speculative verification batch shapes and CUDA graph reuse with +`nsys --cuda-graph-trace=node`. + +### Phase 16 MTP graph-reuse profile + +Phase 16 ran that profile on a smaller direct serving shape (`n=8`, `ptok=64`, +`gen=64`) with `nsys --cuda-graph-trace=node`. + +Artifact: `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016`. + +Result: + +- baseline: `decode_agg_tps=230.5`, `graphs reused = 62`, +- MTP: `decode_agg_tps=97.7`, `graphs reused = 1`, +- MTP drafted (`#gen tokens = 460`, `#acc tokens = 346`), +- `nsys stats` showed materially more GPU kernel time in MTP (~`5.89 s`) than + baseline (~`2.59 s`). + +This supports the root-cause hypothesis: current MTP serving disrupts the paged +decode graph-reuse path and increases GPU work. If MTP is reopened, start at +`tools/server/server-context.cpp` speculative verification batch construction +and graph-reuse keys, not draft-length tuning. + +### Phase 17 MTP graph-shape feasibility + +Phase 17 inspected the source path before any patch. Verdict: no small additive +graph-reuse shortcut is evident. + +Key mechanics: + +- normal decode appends one `output=true` row per generating slot; +- MTP verification appends `K + 1` `output=true` rows per speculative slot, + where `K = spec_draft.size()`; +- total shape is `sum(non_spec * 1) + sum(spec * (1 + K_i)) + prompt rows`; +- `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask rows, position length, and + output-id count are hard graph/input dimensions; +- paged-attention block-table bucketing does not stabilize those verification + token/output dimensions. + +Rejected shortcut: fake padding rows. They would be real target decode rows with +KV, position, logits, MTP nextn embedding, sampling-index, and rollback effects, +and they resemble the already rejected fixed-slot dummy-compute experiment. + +Only plausible next step: an instrumentation-only patch around +`server_slot::handle_last_sampled_token()` to count verification shape buckets. +Only after that should an opt-in scheduling experiment group/defer MTP +verification by `1 + spec_draft.size()`. Keep it default-off and kill it if TTFT +or throughput regresses, graph reuse does not recover, or the md5/op gates drift. + +### Phase 18 MTP shape trace + +Phase 18 added that instrumentation-only patch as 0055. Set +`LLAMA_SPEC_SHAPE_TRACE=1` to log normal decode rows and MTP verification +`K + 1` row/output shapes from `server_slot::handle_last_sampled_token()`. +It is default-off and does not change scheduling, graph keys, logits, KV state, +acceptance, or rollback behavior. + +Red/green result: + +- before patch, `LLAMA_SPEC_SHAPE_TRACE=1` emitted no `spec shape:` lines; +- after patch, a tiny MTP request emitted `kind=verify` shapes with `rows=4` + and `rows=3`; +- with the env var unset, the patched server emitted no `spec shape:` lines. + +Canonical post-patch inference gates stayed green: + +- MoE `8cb0ce23777bf55f92f63d0292c756b0`; +- dense `5951a5b4d624ce891e22ab5fca9bc439`; +- `MUL_MAT_ID` `806/806`. + +Artifacts: + +- `/home/mudler/bench/phase18_mtp_shape_trace_green` +- `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + +Follow-up scope: before any source behavior change, run a trace-only real +serving entropy measurement. Only if repeatable draft-length buckets appear +should an opt-in group/defer-by-draft-length scheduler be built; kill it on +TTFT/throughput regression, graph-reuse failure, md5/op drift, or MTP +rollback/prefix gate failure. + +### Phase 19 MTP serving shape entropy + +Phase 19 ran the trace-only serving measurement. Artifact: +`/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534`. + +Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, +dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +MTP serving stayed slower: + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | +| 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | +| 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + +The shape trace rejects the small scheduler shortcut: + +- per-slot draft length is already stable: `draft=3` is 96.2-96.9% of verify + slots across n8/n32/n128; +- full in-flight steps already mostly use all-`draft=3` vectors; +- remaining aggregate shape churn is active-slot/tail churn plus MTP's real + `K + 1` output-row expansion; +- group/defer-by-draft would not remove the dominant row expansion and would + risk more TTFT loss. + +Decision: do not build a Phase 20 group/defer scheduler on current evidence. +Future MTP work would need a deeper target-verify graph/state design, not +another small server scheduling shortcut. + +### Phase 62 MTP verify-cost result + +Phase 62 is recorded in +`docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md`. Artifact: +`/home/mudler/bench/phase62_mtp_verify_cost/20260701_134125`. + +Pre/post default inference gates stayed green: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID 806/806`. + +| n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | +|---|---------------------|----------------|----------------|------------------|-------------| +| 8 | 248.5 | 104.4 | 42.0% | 1150.4 | 1682.9 | +| 32 | 411.8 | 112.8 | 27.4% | 2607.9 | 4444.7 | +| 128 | 696.5 | 148.1 | 21.3% | 7425.2 | 20155.8 | + +Final MTP stats: `7372/9340 = 0.789` accepted tokens, mean acceptance length +`3.33`, per-position acceptance `(0.877, 0.767, 0.691)`, and +`graphs reused = 1`. Shape trace again showed `draft=3` / `rows=4` dominance +at `95.6%`. + +Decision: reject another MTP implementation phase for now. Phase62 kept default +inference green with md5/op gates, but MTP remains rejected unless a later design +removes target-verify/output-row graph cost. Do not tune `n_max` blindly. + +### Phase 20 current-stack serving snapshot + +Phase 20 refreshed the MoE serving baseline using the current clean DGX mirror +(`~/llama-phase6-source`, `f2521ab12`) and the same-session vLLM server. Artifact: +`/home/mudler/bench/phase20_current_snapshot/20260701_050621`. + +Pre/post canonical gates passed: MoE `8cb0ce23777bf55f92f63d0292c756b0`, +dense `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | +| 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | +| 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +TTFT/prefill remains the largest user-visible gap: + +| n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | +|---|---------------|--------------|------------------|--------------------|------------------| +| 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | +| 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | +| 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + +Decision: the latest stack is still below vLLM serving parity on GB10. The next +credible parity path is not another MTP/scheduler shortcut; it is either the +documented datacenter-Blackwell rerun or a larger fused-kernel project outside +the low-conflict GB10 patch stack. + +### Phase 21 current-stack harness + +Phase 21 added `paged-current-serving-snapshot.sh` so Phase 20 can be repeated +without the stale DGX `combined_definitive.sh` assumptions. The script defaults +to `~/llama-phase6-source`, enforces docker/`local-ai-worker`/GPU-idle preflight, +uses the owner-file lock, runs pre/post md5/op gates, runs paged and vLLM in the +same session, and emits ratio rows in `summary.tsv`. + +Verification: + +- local `bash -n` and `--help` passed; +- DGX `DRY_RUN=1` passed and wrote + `/home/mudler/bench/phase21_harness_dryrun/20260701_051757`. + +Use this harness for future current-stack GB10 snapshots before making parity +claims. + +### Phase 24 snapshot hardware report + +Phase 24 extended `paged-current-serving-snapshot.sh` to write `hardware.txt` +after preflight and before any server launch, including in `DRY_RUN=1`. The +report records `nvidia-smi -L`, GPU name, driver, memory, compute capability +when available, `hardware_class`, and a parity note for that class. + +DGX dry run passed and wrote +`/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741`. It +classified the current DGX as `hardware_class=gb10_or_workstation_blackwell` +with `GPU 0: NVIDIA GB10`, driver `580.159.03`, and compute capability `12.1`. + +Use `hardware.txt` when comparing future snapshots. GB10/workstation Blackwell +results do not establish datacenter-Blackwell parity. + +### Phase 25 snapshot gate summary + +Phase 25 extended `paged-current-serving-snapshot.sh` to write +`gate_summary.tsv` after the post gate in full runs. It also added +`--summarize-gates ART` for auditing existing artifacts without launching +servers. + +The Phase 20 artifact was backfilled at +`/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`. +It records pre/post MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806` as `ok`. + +Use `hardware.txt` plus `gate_summary.tsv` as the quick audit surface before +accepting any new parity snapshot. + +### Phase 26 audited current-stack snapshot + +Phase 26 ran the full current-stack paged-vs-vLLM MoE serving snapshot with the +Phase 24/25 audit files enabled: +`/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. + +The artifact records `hardware_class=gb10_or_workstation_blackwell` on GPU +`NVIDIA GB10` with driver `580.159.03` and compute capability `12.1`. +`gate_summary.tsv` reports every pre/post gate as `ok`: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Audited MoE serving result (`PTOK=128`, `GEN=64`): + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Decision: the latest audited clean-stack run still does not reach vLLM serving +parity on GB10. Treat Phase 26 as the current benchmark baseline before funding +new kernel work, and keep md5/op gates as the first check when changing the +patch stack. + +### Phase 27 graph-node-traced current-stack profile + +Phase 27 re-profiled the current clean llama.cpp n128 serving path with +`--cuda-graph-trace=node`, using the same source (`f2521ab12`) and GB10 host. +Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519`. + +The profile run itself reported `decode_agg_tps=675.5`, close to Phase 26's +n128 paged `673.4`, so the trace is representative for bucket direction. Pre +gates passed, and the post-gate retry passed after Nsight teardown finished: +MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Graph-node-traced macro buckets: + +| bucket | time ms | share | +|--------|---------|-------| +| GDN | 6706.33 | 33.47% | +| MoE/FFN-GEMM | 5871.92 | 29.31% | +| bf16-proj | 2725.07 | 13.60% | +| layout-copy | 1309.99 | 6.54% | +| act-quant | 697.75 | 3.48% | +| MoE-dispatch | 275.99 | 1.38% | +| FA | 271.03 | 1.35% | + +Fine rows keep the same decision shape as Phase 8: `gdn_core` is `29.59%`, +`mmq_nvfp4` is `28.44%`, while `mm_ids` is `0.61%`, `gather_mmq` is `0.37%`, +and `argsort_topk` is `0.40%`. Do not reopen metadata/helper-only MoE dispatch +work on GB10. Any credible source patch must directly reduce GDN, grouped-MMQ, +or projection work and still pass the md5/op gates. + +### Phase 28 NVFP4 MMQ occupancy A/B + +Phase 28 challenged the small grouped-MMQ build knobs before funding structural +kernel work. Artifact: +`/home/mudler/bench/phase28_mmq_occupancy/20260701_040450`. + +`GGML_CUDA_FP4_MINBLOCKS=2` built and passed the canonical safety gates before +and after serving: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. Same-session +n128 serving A/B rejected it on throughput: baseline `705.1` decode_agg_tps vs +`689.9` with `MINBLOCKS=2` (`0.9784x`). `GGML_CUDA_FP4_MMQ_Y=64` does not +compile against the current NVFP4 writeback invariant +`nwarps*tile_C::I == mmq_y`, so the row-tile knob is not a valid low-conflict +shortcut. + +Decision: do not promote the occupancy knobs and do not add a LocalAI patch. +The grouped-MMQ bucket still requires structural kernel work; launch-bounds and +row-tile build tweaks are closed on GB10. + +### Phase 29 default-off MoE MMQ shape trace + +Patch `0056` adds `LLAMA_MOE_MMQ_SHAPE_TRACE=` as bounded, default-off +instrumentation at the grouped-MMQ host selector. Artifact: +`/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428`. Fork commit: +`20a99518a feat(cuda): trace moe mmq batch shapes`. + +The helper was added test-first (`test-cuda-mmq-shape-trace` failed on the +missing header before implementation, then passed locally and under the DGX CUDA +build). Default-off and trace-enabled gates both passed: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. The +trace-enabled gate with `LLAMA_MOE_MMQ_SHAPE_TRACE=4` emitted exactly four +shape lines. + +Use this only to size the next grouped-MMQ structural kernel. It intentionally +does not perform device readback of `expert_bounds`, so it records selector +inputs and estimated density rather than exact per-expert histograms. + +### Phase 30 live serving MMQ shape distribution + +Phase 30 ran n128 serving with `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` on the patched +DGX mirror (`826c97a05`). Artifact: +`/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300`. + +The first 4096 grouped-MMQ calls split into 1200 decode-like calls +(`ncols_max <= 128`) and 2896 prefill-like calls. Decode-like calls used +densities `1-4` and selected only `mmq_x_best` `32/40/48/64` +(`64`: 480, `32`: 360, `40`: 240, `48`: 120). Prefill-like calls were mostly +density `16` and selected `mmq_x_best=128` for 1816 calls. Every traced call had +`stream_k=1`. + +Kernel implication: the next grouped-MMQ structural experiment should target +small-M decode tiles (`ncols_max` 26-111, density 1-4) separately from prefill. +The current stream-k/fixup path is part of the measured shape and cannot be +ignored by a replacement kernel. + +### Phase 31 live serving MMQ launch distribution + +Phase 31 added patch `0057`, extending `LLAMA_MOE_MMQ_SHAPE_TRACE=` with +`[LLAMA_MOE_MMQ_LAUNCH]` lines emitted from `launch_mul_mat_q` after the actual +stream-k launch policy is known. Artifact: +`/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. + +The default-off, trace-enabled, and post-serving gates all stayed bit-exact: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Live n128 serving with `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` produced: + +| bucket | launch lines | `fixup=1` | `stream_k_blocks == ntiles_dst` | tile efficiency | +|--------|--------------|-----------|----------------------------------|-----------------| +| decode-like (`ncols_max <= 128`) | 4800 | 0 | 4800 | 96-99 | +| prefill-like (`ncols_max > 128`) | 4920 | 0 | 4920 | 99-100 | + +Lever implication: a no-fixup/no-stream-k shortcut is rejected for the measured +n128 serving workload. The launch code is already choosing conventional +stream-k tiling with no fixup; the remaining gap is the small-M grouped-MMQ +kernel shape itself, not launch/fixup overhead. + +### Phase 32 small-M MMQ candidate classifier + +Phase 32 added patch `0058`, a default-off +`LLAMA_MOE_MMQ_SMALL_M_TRACE=` classifier for decode-like low-density MoE +grouped-MMQ calls. Artifact: +`/home/mudler/bench/phase32_small_m_classifier/20260701_070127`. + +The default-off, trace-enabled, and post-serving gates all stayed bit-exact: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Live n128 serving with `LLAMA_MOE_MMQ_SMALL_M_TRACE=4096` found 4096 candidate +calls: + +| metric | notable values | +|--------|----------------| +| `mmq_x_best` | `64`: 1800, `48`: 1096, `40`: 360, `32`: 360, `16`: 360, `24`: 120 | +| density | `4`: 1440, `3`: 1336, `1`: 840, `2`: 480 | + +Lever implication: Phase 33 should A/B a default-off small-M tile policy, first +forcing candidate calls to `mmq_x=16` and only then trying `8` if it compiles +and keeps the NVFP4 tile invariants. This matches the vLLM/Marlin lesson that +low-density routed expert rows want smaller M blocks, without porting Marlin, +Triton, TMA, tcgen05, or layout repack machinery. + +### Phase 33 small-M tile policy rejection + +Phase 33 added patch `0059`, default-off `LLAMA_MOE_SMALL_M_TILE=`, and +tested the obvious vLLM-like shortcut on the Phase 32 candidate population. +Artifact: `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`. + +Default-off, tile16, tile8, and post-serving gates were all bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Same-session n128 serving: + +| mode | decode_agg_tps | ratio | +|------|----------------|-------| +| baseline | 672.1 | 1.000x | +| `LLAMA_MOE_SMALL_M_TILE=16` | 640.3 | 0.953x | +| `LLAMA_MOE_SMALL_M_TILE=8` | 583.2 | 0.868x | + +Lever implication: smaller `mmq_x` alone is rejected for n128 serving. The +remaining grouped-MMQ gap is not solved by emulating Marlin's small `block_size_m` +with the current MMQ kernel; a future attempt must alter the kernel's internal +work partitioning or move to a different bottleneck. + +### Phase 34 MMID route trace + +Phase 34 added patch `0060`, default-off `LLAMA_MOE_MMID_ROUTE_TRACE=`, to +classify the live `MUL_MAT_ID` dispatch route without changing the route. Artifact: +`/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`. + +Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. + +Live n128 serving with `LLAMA_MOE_MMID_ROUTE_TRACE=4096` produced: + +| route | count | host sync | +|-------|-------|-----------| +| grouped `mmq` | 2776 | 0 | +| `mmvq` | 1320 | 0 | +| `mmf` | 0 | 0 | +| fallback | 0 | 0 | + +Top route shapes were `mmq ne2=12` (1096), `mmq ne2=18` (480), and +`mmvq ne2=8` (360). Lever implication: the old D1 concern that current n128 +serving might fall into the per-expert host-sync fallback is refuted for this +stack. The remaining MoE route issue is grouped-MMQ small-M efficiency, not +fallback dispatch avoidance. + +### Phase 35 regular MUL_MAT route trace + +Phase 35 added patch `0061`, default-off `LLAMA_MUL_MAT_ROUTE_TRACE=`, to +classify regular `MUL_MAT` routes for the `bf16-proj` serving bucket. Artifact: +`/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`. + +Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Live n128 serving with `LLAMA_MUL_MAT_ROUTE_TRACE=8192` produced: + +| route | count | +|-------|-------| +| `mat_f` | 2888 | +| `op_cublas` | 2292 | +| `mmq` | 1328 | +| `vec_q` | 1214 | +| `vec_f` | 470 | + +The trace was BF16-heavy (`type=30`: 3965 calls), mostly `mat_f=2485` and +`op_cublas=1330`. Top BF16 shapes were `mat_f ne1=12` (775), +`op_cublas ne1=18` (760), and `mat_f ne1=8` (570); `ne12=ne13=1` throughout the +top shapes, so batched cuBLAS is not the measured target. + +Lever implication: the next projection phase should add cuBLAS/MMF subroute +detail or test a narrow BF16 route policy for the generic `op_cublas` shapes. +Do not spend time on batched cuBLAS for this n128 serving slice. If MTP is enabled +in a future serving configuration, first isolate `mtp_eh_proj` / shared-head +projection with `llama-debug --tensor-filter 'mtp_|h_nextn|nextn|ffn_|attn_'` +before optimizing ordinary decoder projections. + +### Phase 36 cuBLAS subroute trace + +Phase 36 added patch `0062`, default-off `LLAMA_CUBLAS_ROUTE_TRACE=`, to +classify the generic cuBLAS `MUL_MAT` subroute without changing branch behavior. +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. + +Default-off, trace-enabled, and post-serving gates were all bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Live n128 serving with `LLAMA_CUBLAS_ROUTE_TRACE=8192` produced: + +| cuBLAS route | count | +|--------------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top SGEMM shapes were `type=0 row_diff=256/1 src1_ncols=510 ne00=2048 +ne10=2048`. Lever implication: the measured `op_cublas` bucket is BF16 +tensor-core plus F32 SGEMM, not NVFP4 cuBLAS and not batched cuBLAS. The next +projection phase should explain whether the F32 SGEMM shapes are expected glue +tensors or a missed BF16 route, with md5/op gates before any route policy A/B. + +### Phase 37 cuBLAS tensor-name trace + +Phase 37 added patch `0063`, extending `LLAMA_CUBLAS_ROUTE_TRACE=` with +`src0`, `src1`, and `dst` names. Artifact: +`/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`. + +Default-off, trace-enabled, and post-serving gates stayed bit-exact: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Live n128 serving with trace cap 4096 found `bf16_tc=2884`, `sgemm=1212`. +The `sgemm type=0` entries are the MoE gate logits and shared-expert gate +projections: `blk.N.ffn_gate_inp.weight -> ffn_moe_logits-N` and +`blk.N.ffn_gate_inp_shexp.weight -> shared_expert_gate-N`. Attention and SSM +projections in the sample are already `bf16_tc`. + +Lever implication: do not blindly force the `sgemm` bucket to BF16. First inspect +why `ffn_gate_inp*` loads as F32 and whether a dtype or graph-route change is +precision-safe. If attempted, use md5/op gates plus KL validation. + +### Phase 38 gate projection policy + +Phase 38 re-ran the current Phase37 build safety gate before changing policy: +artifact `/home/mudler/bench/phase38_gate_baseline/20260701_084410`, MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Source check: llama.cpp's Qwen35MoE graph uses `ffn_gate_inp.weight` for +`ffn_moe_logits` and `ffn_gate_inp_shexp.weight` for `shared_expert_gate`. vLLM +Qwen3-Next also constructs those gates with `quant_config=None`; the relevant +vLLM idea is not reduced precision, but concatenating router and shared-expert +gate weights in the fused-MoE runner when shared-expert fusion is active. + +Lever implication: keep `ffn_gate_inp*` as inference-critical F32 policy. A +future low-conflict experiment may test a default-off fused F32 gate projection +that computes both logits in one matmul and splits the output, but it must pass +MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates before benchmarking. If md5 +changes, run the KL gate first and reject on any KL regression. + +### Phase 39 gate fusion feasibility + +Phase 39 rejected the tempting low-conflict implementation of the Phase38 idea: +do not build a graph-time `ggml_concat()` of `ffn_gate_inp.weight` and +`ffn_gate_inp_shexp.weight` just to issue one combined gate matmul. Phase37 +proved the named `sgemm` bucket is the two gate projections, but Phase27's +graph-node serving profile already has `concat_layout=459.84ms` (`2.29%`, +`2250` instances) in a `20.0372s` kernel window. Adding another concat path for +weights would likely trade one small SGEMM shortcut for more layout-copy work. + +The follow-up remains valid only in the persistent-weight form: create a +load-time F32 combined gate tensor, run one matmul, and view/split the output +into `ffn_moe_logits` and `shared_expert_gate`. That is a model-loader/weight +layout feature, not a graph shortcut. It must stay default-off until MoE/dense +md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates pass. + +### Phase 40 max-concurrency C1 check + +Phase 40 tested whether paged KV's memory advantage creates a higher-concurrency +GB10 serving point that closes the vLLM gap. Artifact: +`/home/mudler/bench/phase40_max_concurrency/20260701_090012`. The run used +`PARALLEL=256`, `CTX=262144`, `PTOK=128`, `GEN=64`, `NPL="128 192 256"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. + +Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Result: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 128 | `0.6630` | `0.5908` | `0.4986` | `3.1682` | +| 192 | `0.5737` | `0.5123` | `0.4562` | `3.0216` | +| 256 | `0.6354` | `0.5359` | `0.4721` | `2.9401` | + +Decision: C1 does not close GB10 parity at `PTOK=128`, `GEN=64`, and `n<=256`. +Paged safely serves `n=256`, but vLLM also fits and remains faster. Do not use +the memory-footprint advantage as a parity claim at this tested point; any +future C1 retry must push beyond it and keep md5 plus `MUL_MAT`/`MUL_MAT_ID` +gates. + +### Phase 41 low-concurrency serving check + +Phase 41 measured the low-concurrency serving regime where any remaining +host/scheduler gap should be most visible. Artifact: +`/home/mudler/bench/phase41_low_concurrency/20260701_091437`. The run used +`PARALLEL=32`, `CTX=32768`, `PTOK=128`, `GEN=64`, `NPL="1 8 32"`, and +`OPS=MUL_MAT,MUL_MAT_ID`. + +Pre/post gates stayed green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Result: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `0.7493` | `0.7501` | `0.7496` | `1.3830` | +| 8 | `0.7518` | `0.7398` | `0.6334` | `3.1425` | +| 32 | `0.6649` | `0.6397` | `0.5282` | `3.4014` | + +Decision: low-concurrency remains a gap, but Phase41 does not reopen +D1/full-step graph capture. Patch `0043` already ships grouped-MMQ full-step +decode graph capture default-on, Phase34 found `host_sync=0/4096`, and S3 is +intentionally default-off because it hurts TTFT/end-to-end throughput. Treat +D1 as closed on the current GB10 path unless a fresh route trace proves a +host-sync fallback or graph-disable condition has returned. TTFT evidence keeps +prefill GDN/MoE work in scope for serving quality. + +### Phase 42 target reconciliation + +Phase 42 challenged the current target list with three read-only subagent +reviews: + +- D1/full-step graph capture: closed on current GB10 path. `0040` S1 is + default-on graph reuse, `0041` S3 is opt-in only, and `0043` D1 is default-on + grouped-MMQ full-step CUDA graph capture. +- GDN prefill: the shipped GB10 wins are `0046`/`0047`; later C32 slab, + QS-early, and Global-Ai32 variants were correctness-clean but slower. Do not + add another low-conflict GDN reorder on GB10. +- W4A16 / prefill GEMM: `0033`/`0034`/`0035` remain default-off; `0048`-`0050` + improved forced W4A16 only marginally and still did not beat default MMQ. Do + not add another small W4A16 body/metadata tweak. + +Phase 43 then checked that candidate against the actual Qwen35MoE model-loader +path and rejected it as a small shortcut. `ffn_gate_inp.weight` and +`ffn_gate_inp_shexp.weight` are separate GGUF tensors consumed by separate graph +matmuls; `create_tensor(...)` only materializes tensors from GGUF metadata, and +`create_tensor_as_view(...)` can view existing tensors but cannot create a new +persistent concatenated derived weight. A correct load-time combined gate would +need a general derived-weight allocation/materialization path across mmap, +offload, split buffers, and MTP blocks. Do not implement a Qwen-only loader hack, +and do not fall back to graph-time `ggml_concat()`. + +The resulting GB10 state after Phase43: no remaining low-conflict shortcut patch +is justified by the current evidence. Future work needs either a larger funded +kernel/loader design or a hardware-pivot benchmark, with the canonical +MoE/dense md5, `MUL_MAT`, `MUL_MAT_ID`, and KL-if-md5-changes gates. + +### Phase 44 hardware-pivot harness readiness + +Phase 44 makes `paged-current-serving-snapshot.sh` usable for hardware-pivot +comparisons without editing the script for each vLLM deployment shape. It adds +environment overrides for `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, +`VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, and whitespace-split +`VLLM_EXTRA_ARGS`, then prints the resolved values in `DRY_RUN=1` output. + +This is deliberately a harness-only phase. It does not change inference code, +does not regenerate the llama.cpp patch series, and does not produce a new +throughput result. Its purpose is to keep the audited methodology portable: +future non-GB10 snapshots can carry the same `hardware.txt`, pre/post md5, +`MUL_MAT`/`MUL_MAT_ID`, and KL-if-md5-changes gates while using hardware-specific +vLLM serving limits. + +### Phase 45 inference gate guard + +Phase 45 ran the canonical paged inference safety gate after the Phase44 harness +change. Artifact: +`/home/mudler/bench/phase45_inference_gate_guard/20260701_094320`. + +Results stayed green on the DGX phase36 build: MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. This confirms the current build still satisfies the +inference-safety gates before any later hardware-pivot or larger kernel work. + +### Phase 46 served-model-name harness readiness + +Phase 46 removes the hardcoded `q36` served model name from +`paged-current-serving-snapshot.sh`. The new `SERVED_MODEL_NAME` environment +variable defaults to `q36` and is used consistently for vLLM +`--served-model-name`, the vLLM `/v1/models` readiness check, and h2h `--model` +requests on both arms. + +DGX dry-run artifact: +`/home/mudler/bench/phase46_served_model_name_dryrun/20260701_094849`. +Preflight was clean and the dry run printed +`SERVED_MODEL_NAME=dense-q36` before any server launch. This is another +harness-only portability step for dense or hardware-pivot snapshots; it does not +change inference code or produce a new throughput result. + +### Phase 47 dense serving snapshot attempt + +Phase 47 attempted a dense audited serving snapshot with +`MODEL=$HOME/bench/q36-27b-nvfp4.gguf`, +`VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm`, and +`SERVED_MODEL_NAME=dense-q36`. Dry-run artifact: +`/home/mudler/bench/phase47_dense_serving_dryrun/20260701_095141`. + +The full attempt at +`/home/mudler/bench/phase47_dense_serving/20260701_095151` is incomplete and is +not a parity result. Pre-gates passed and the paged dense arm completed through +`n=128`, but vLLM dense startup exceeded the old fixed readiness budget before +any vLLM result JSONs were produced. Use this artifact only as the root-cause +input for Phase48. + +### Phase 48 serving harness readiness hardening + +Phase 48 fixes the harness issue exposed by Phase47. It adds +`LLAMA_READY_ATTEMPTS` and `VLLM_READY_ATTEMPTS`, bounds each readiness probe +with `curl --max-time 2`, and replaces direct server waits with bounded cleanup +that escalates from `SIGTERM` to `SIGKILL`. + +DGX dry-run artifact: +`/home/mudler/bench/phase48_readiness_harness_dryrun/20260701_100533`. The dry +run printed `VLLM_READY_ATTEMPTS=700` with clean preflight. Retry dense serving +snapshots with this hardening before interpreting dense paged-vs-vLLM ratios. + +### Phase 47 dense serving snapshot retry + +After Phase48, the dense snapshot completed at +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` with pre/post +gates green: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Dense paged-vs-vLLM ratios: + +| n | paged decode / vLLM | paged per-seq / vLLM | paged agg / vLLM | paged TTFT / vLLM | +|---|---------------------|----------------------|------------------|-------------------| +| 1 | `1.3434` | `1.3488` | `1.3021` | `1.8746` | +| 8 | `1.1560` | `1.1493` | `0.9142` | `4.0467` | +| 32 | `0.9036` | `0.8382` | `0.6168` | `3.6450` | +| 128 | `0.7912` | `0.6436` | `0.5071` | `3.2011` | + +Decision: dense low-N decode remains a real paged strength, but dense serving +still does not close GB10 parity because TTFT and high-concurrency aggregate +throughput remain substantially behind vLLM. + +### Phase 50 dense true decode profile + +Phase50 profiles dense `npl=128`, `npp=128` decode with graph nodes expanded and +uses the difference method (`ntg=64 - ntg=16`) instead of the Phase47 h2h +serving window. Artifact: +`/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. + +Pre/post inference gates stayed green on the profiled `build-cuda` binary set: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. A `build-phase36` pre-gate also passed, but +`build-phase36` did not contain `llama-batched-bench`, so `build-cuda` is the +profiled/gated build for this phase. + +Results: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +Decision: Phase47's dense high-N serving loss is not just a kernel-speed gap. +True dense decode is still behind vLLM by about `12%`, but the Phase47 h2h +decode ratio at `n=128` was `0.7912` and aggregate serving was only `0.5071`. +The remaining difference points at scheduler/admission, prefill overlap, and +TTFT accounting. Next implementation target should be an opt-in +batch-composition/admission trace in `server_context::pre_decode()` before any +new GDN/GEMM shortcut. + +### Phase 51 serving admission trace + +Phase51 adds that trace in the llama.cpp fork. Fork commit: +`c6cb8460e feat(server): trace serving admission batches`. + +The change is default-off behind `LLAMA_SERVING_TRACE=1` and does not change +inference decisions. It records aggregate scheduler-shape counters from +`server_context_impl::pre_decode()`: decode tokens, prompt tokens admitted, +waiting prompt slots, started/continued prompt slots, decode-only steps, +`n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot`. + +Verification: + +- Red test first: `test-server-admission-trace` failed before + `server-admission-trace.h` existed. +- Local fork: unit test and `llama-server` build passed. +- DGX artifact: + `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` +- DGX patched `build-cuda` CTest passed. +- DGX patched `build-cuda` inference gates stayed green: MoE + `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and + `MUL_MAT_ID` `806/806`. + +Mirror status: pending explicit approval to push the fork branch, then +regenerate the LocalAI patch series from the pushed fork commit. + +### Phase 52 dense admission trace + +Phase52 used the Phase51 trace on DGX to measure dense `n=128`, `ptok=128`, +`gen=64` llama-server admission. Artifact: +`/home/mudler/bench/phase52_dense_admission_trace/20260701_111017`. + +The traced build was bracketed by canonical gates, all green before and after: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +Clean trace: + +| h2h wall s | decode agg t/s | TTFT mean ms | steps | decode-only steps | decode tokens | prompt tokens | max waiting prompt slots | +|------------|-----------------|--------------|-------|-------------------|---------------|---------------|--------------------------| +| `58.921` | `360.5` | `23171.5` | `76` | `0` | `8064` | `22785` | `35` | + +Decision: the default scheduler never emitted pure decode steps for this +high-N dense run. Prompt tokens matched h2h exactly, and prompt admission used +the stock path (`prefill_budget_step=0`, `prefill_cap_per_slot=0`). This +supports the Phase50 conclusion that the remaining high-N serving gap is +scheduler/admission and TTFT shaped. Next lever should be a default-off +admission-policy A/B or per-step histogram trace, not immediate kernel work. + +### Phase 53 admission budget sweep + +Phase53 tested the already-existing default-off budget knobs: +`LLAMA_MAX_BATCH_TOKENS=1536/1024` with `LLAMA_PREFILL_CAP=512`, using the same +dense `n=128`, `ptok=128`, `gen=64` traced serving shape. Artifact: +`/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | wall s | max waiting prompt slots | +|---------|---------|-----------------|-------------|--------------|--------|--------------------------| +| default Phase52 | `139.0` | `360.5` | `629.5` | `23171.5` | `58.921` | `35` | +| `T=1536 cap=512` | `134.4` | `376.7` | `607.0` | `22263.7` | `60.968` | `26` | +| `T=1024 cap=512` | `130.0` | `392.4` | `565.2` | `23234.3` | `63.003` | `16` | + +Decision: simple budget shrinkage is rejected as a parity lever. It improves +the h2h decode-agg metric by starving/slimming prompt admission, but aggregate +throughput and prefill throughput fall, and TTFT does not materially improve. +Next scheduler work should collect per-step histograms or test a targeted +first-token admission policy. + +### Phase 54 admission histogram trace + +Phase54 extended the Phase51 default-off trace with prompt-token, +decode-token, and waiting-slot histograms. Fork stack: +`c6cb8460e feat(server): trace serving admission batches` and +`bd7b2e952 feat(server): add admission trace histograms`. + +Artifact: +`/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`. + +Pre/post md5 and op gates stayed green on the temporary DGX patch stack: +MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +The Phase52-aligned dense run used `n=128`, `ptok=168`, `gen=64`, producing +`prompt_tok_total=22913`, `agg_tps=138.1`, `decode_agg_tps=360.2`, +`prefill_tps=626.7`, `ttft_mean_ms=23393.2`, and `wall_s=59.303`. + +Trace: + +```text +steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +Decision: the scheduler does not spend every step over-admitting prompt work. +Most steps have no waiting prompts and no prompt tokens, while prompt admission +is concentrated into a small number of large chunks. This rejects global +budget-shrinkage as the next path and points to a targeted first-token +admission or prompt-front-loading A/B, gated by the same md5 and backend-op +checks. + +### Phase 55 TTFT prefill-first scheduler A/B + +Phase55 implemented the targeted first-token admission A/B proposed by +Phase54. It is default-off behind `LLAMA_TTFT_PREFILL_FIRST=1`; while any prompt +is still waiting for first-token admission, it defers token 2+ decode rows from +already-started streams. This does not lower `LLAMA_MAX_BATCH_TOKENS` and does +not change default scheduling. + +Fork commit: +`8a97629a4 feat(server): add TTFT prefill-first scheduler mode`. + +Artifact: +`/home/mudler/bench/phase55_ttft_prefill_first/20260701_114929`. + +Pre/post/after-A-B md5 and op gates stayed green on the temporary DGX patch +stack: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +Dense `n=128`, `ptok=168`, `gen=64` A/B: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `138.2` | `361.3` | `626.0` | `23231.9` | `36599.5` | `59.272` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `142.9` | `336.9` | `694.2` | `21520.8` | `33008.2` | `57.323` | + +Trace comparison: + +- Default: `ttft_deferred_decode_slots=0`, + `prompt_hist=0:63,1-64:1,513+:12`, + `decode_hist=0:3,1-63:10,64-127:10,128-255:53`. +- Opt-in: `ttft_deferred_decode_slots=660`, + `prompt_hist=0:63,1-64:1,257-512:1,513+:11`, + `decode_hist=0:13,128-255:63`. + +Decision: keep the policy as a promising default-off scheduler A/B. It improved +dense aggregate throughput by `+3.4%`, mean TTFT by `-7.4%`, max TTFT by +`-9.8%`, and wall time by `-3.3%`. The h2h decode-agg drop is expected because +the policy shifts early compute from token 2+ decode to first-token prompt +admission. Before any default-on discussion, test MoE serving and at least one +additional concurrency point. + +### Phase 56 TTFT prefill-first validation + +Phase56 made no code changes. It reapplied the Phase55 stack temporarily on DGX +and tested the opt-in policy on MoE `n=128` and dense `n=32`. Artifact: +`/home/mudler/bench/phase56_ttft_prefill_first_validation/20260701_115852`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `341.1` | `651.2` | `1555.9` | `7168.1` | `11435.5` | `24.015` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `339.9` | `623.8` | `1622.7` | `7615.3` | `10964.4` | `24.098` | + +Dense `n=32`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `104.3` | `197.1` | `617.2` | `7687.7` | `9234.4` | `19.627` | +| `LLAMA_TTFT_PREFILL_FIRST=1` | `106.7` | `193.5` | `662.1` | `7284.3` | `8609.1` | `19.194` | + +Decision: keep `LLAMA_TTFT_PREFILL_FIRST=1` opt-in only. It helps dense +serving at `n=128` and `n=32`, but MoE `n=128` regresses mean TTFT by `+6.2%` +and aggregate throughput by `-0.4%`. Do not promote it as a broad default. +Future scheduler work should either narrow the policy to dense/non-MoE shapes or +make the defer condition more selective for MoE. + +### Phase 57 capped TTFT defer sweep + +Phase57 added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER` as an optional per-step cap +on the Phase55 policy. Unset or `0` keeps the Phase55 unlimited behavior. +Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | + +Decision: reject capped defer as a parity lever. cap32 is the only interesting +MoE point, but it trades lower mean TTFT for lower aggregate throughput and +higher wall time. Dense caps also lose aggregate. Keep the cap as an opt-in A/B +knob only. + +### Phase 58 waiting-threshold TTFT defer + +Phase58 added `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`, so TTFT prefill-first +only activates when the number of prompt-waiting slots is at or above a +threshold. Artifact: +`/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` | +| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` | +| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` | +| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` | + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|---------|---------|-----------------|-------------|--------------|-------------|--------| +| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` | +| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` | +| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` | + +Decision: `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` is the best selective +scheduler A/B so far: MoE `n=128` improved aggregate, TTFT, and wall in the same +window, while dense `n=128` was roughly neutral but slightly worse on aggregate +and wall. Keep it opt-in until repeated and compared against matching vLLM h2h. + +### Phase 59 MoE min32 repeat and vLLM H2H + +Phase59 repeated the Phase58 MoE min32 point, then ran matching vLLM serving. +Artifact: +`/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147`. + +Pre/post llama md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | +|------------------|---------|-----------------|-------------|--------------|-------------|--------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | + +Decision: min32 repeated as a real llama.cpp scheduler QoS improvement +(`-8.1%` mean TTFT with flat aggregate and wall), but it is not a vLLM parity +lever. Llama min32 is still `0.560x` vLLM aggregate, `0.430x` vLLM prefill, +`0.673x` vLLM decode aggregate, and `2.415x` slower on mean TTFT. Keep the +scheduler knob opt-in and return parity work to the prefill / MoE compute gap. + +Phase72 broadened that min32 result to the Phase70 serving shape. Artifact: +`/home/mudler/bench/phase72_ttft_min32_serving/20260701_160730`. Gates stayed +green, but min32 regressed every tested concurrency: aggregate ratios +`0.9302`/`0.9414`/`0.9699`, decode ratios `0.9442`/`0.9570`/`0.9775`, and TTFT +ratios `1.0379`/`1.0977`/`1.0300` at `n=8/32/128`. Keep min32 opt-in only and +do not default it on GB10. + +Phase73 made the post-Phase72 next-step decision. It ran no new benchmark and +changed no llama.cpp source. Grouped-MMQ/W4A16 GB10 source work is closed: +Phase61 direct-A was the last structurally distinct W4A16 shortcut and failed +its keep gate, and Phase66 quantize plus gather was only `5.10%`. GDN backend +source work is also gated: Phase71 kept M5 as shipped, and the remaining GDN +gap is a FLA/CuteDSL-class C=64 blocked-solve/register-state implementation, +not another local reorder. The next parity evidence should be a datacenter +Blackwell same-session rerun, or a standalone GDN blocked-solve PoC before any +backend GDN source work. + +### Phase 60 current W4A16 prefill profile + +Phase60 re-profiled the current W4A16 grouped MoE prefill path after the +Phase1-5 W4A16 work. Artifact: +`/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915`. + +Pre/post md5 and op gates stayed green: MoE +`8cb0ce23777bf55f92f63d0292c756b0`, dense +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and `MUL_MAT_ID` +`806/806`. + +MoE prefill A/B (`npl=32`, `ntg=4`) still rejects W4A16 as an incremental +parity path: + +| path | npp512 S_PP | npp2048 S_PP | +|------|-------------|--------------| +| default FP4-MMQ | `2327.69` | `2423.20` | +| forced W4A16 | `1451.00` | `1482.76` | + +At `npp=512`, default MMQ spends `2.712s` (`39.2%`) in its main +`mul_mat_q` bucket. Forced W4A16 spends `4.142s` (`42.5%`) in +`w4a16_grouped_kernel<32,128,1,4,2>`, plus `1.094s` (`11.2%`) in +`k_get_rows_float` sorted activation gathers and `0.517s` (`5.3%`) +in `w4a16_cast_act_f32_bf16`. + +Decision: do not add another W4A16 micro-patch. Cast elimination alone cannot +close a `37-39%` S_PP loss, and the dominant loss is the grouped kernel body +plus sorted activation movement. Future W4A16 parity work must be a larger +design that changes those structures, not another metadata/body shortcut. + +### Phase 61 W4A16 direct activation kill-gate + +Phase61 implemented the larger direct-activation experiment behind +`LLAMA_W4A16_DIRECT_A=1`, consuming original `src1` and `ids_to_sorted` directly +instead of materializing `src1_sorted` and then casting it to bf16. The correct +source addressing matched `get_rows_cuda`: `ids_to_sorted` is a flat source-row +index addressed with `nb11`. The initial token/slot decode failed `b=1` op +tests; the flat-row fix passed forced direct-A `MUL_MAT_ID` `806/806`. + +Artifacts: + +- default gates: `/home/mudler/bench/phase61_direct_default_gates/20260701_132057` +- A/B: `/home/mudler/bench/phase61_direct_ab/20260701_132237` + +Gates: + +- default MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- default dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` +- forced W4A16 and direct-A MoE transcripts both + `07db32c2bcb78d17a43ed18bc22705cd` + +MoE prefill A/B (`npl=32`, `ntg=4`): + +| path | npp512 S_PP | npp2048 S_PP | +|------|-------------|--------------| +| default FP4-MMQ | `2325.45` | `2423.18` | +| forced W4A16 | `1471.05` | `1502.46` | +| forced W4A16 direct-A | `1566.30` | `1605.82` | + +Decision: reject. Direct-A improved forced W4A16 by only `+6.5%` / `+6.9%` and +remained `0.67x` / `0.66x` of default FP4-MMQ, below the `+12%` and `0.75x` +keep gates. The direct kernel diff was saved to +`/tmp/phase61-w4a16-direct-a-rejected.diff` and not committed. W4A16 body +tuning is no longer the next GB10 parity lever. + +Relevant files (all absolute): `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{DECODE_SERVING_SCOPE.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,TENSORCORE_GDN_SCOPE.md,final_benchmark.csv}`, `.../README.md`, `.../patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch` (P1/P2), `.../patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch` (P7), `.../patches/paged/0031` (P4), `0025` (D1), `0018/0022` (D4/D5), `0009/0010` (D3/D6/D7); graph source `/home/mudler/_git/LocalAI/backend/cpp/llama-cpp-paged-dev/src/{models/qwen35moe.cpp,models/delta-net-base.cpp,llama-graph.cpp}`. + +### Phase 10 GDN C32 slab update + +Phase 10 tested the tempting low-conflict shortcut for #101: keep the current +M5 tensor-core GDN form, raise the chunk to `C=32`, and split the value +dimension into two `dv_tile=64` slabs to stay within shared memory. + +Result: + +- The shortcut cannot be a launcher-only change. C32 requires staging + `U=T*RHS` because the existing M5 apply path relies on one 16-row tile being + held in registers before overwriting `Ud`. +- A default-off `GDN_C32_SLAB=1` candidate was built and md5-gated. +- The first candidate exposed a dense-only transcript failure on tail chunks; + root cause was copying uninitialized staged rows for `t >= Cc` back into + `Ud`. Zeroing those rows restored both canonical md5 gates: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Performance regressed after correctness was fixed: + MoE 2048 S_PP `2430.32 -> 2054.86`; dense 2048 S_PP `1019.25 -> 903.73`. + +Decision: + +- **REJECT** the two-slab C32 M5 variant. +- Do not add it to the LocalAI patch stack. +- The likely blocker is duplicated A/T recomputation per value slab; future GDN + work must share that work across slabs or move to a different FLA-style + chunked design rather than repeating this env-gated shortcut. + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/gates/` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/` +- `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff` + +### Phase 11 GDN M5 QS-early update + +Phase 11 tested the smallest possible C=16 follow-up after the C32 slab +rejection: move the `QS = Qc * S0` state-boundary product earlier in the M5 +chunk loop behind `GDN_M5_QS_EARLY=1`. + +Result: + +- The candidate built on DGX and stayed md5-exact: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- It regressed S_PP slightly in both families: + MoE 2048 `2441.54 -> 2420.26`, dense 2048 `1021.06 -> 1015.77`. + +Decision: + +- **REJECT** QS-early. +- Do not add it to the LocalAI patch stack. +- A scheduling-only move that still performs the same QS MMA does not close the + GDN gap. The next GDN scope should be a real shared-A/Ai blocked-solve or + global-scratch design, not another local reorder. + +Artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff` + +### Phase 12 GDN shared-A/Ai cost-model update + +Phase 12 scoped the next non-shortcut GDN path: compute f32 Ai once per +`(sequence, head, chunk)` and reuse it across two `dv_tile=64` value slabs. + +Cost model: + +- C16 full-width M5 uses `93,376 B` dynamic smem. +- C32 full-width would need `127,360 B`, which does not fit GB10. +- C32 slab64 fits at `94,592 B`, but Phase 10 showed it loses when A/T is + recomputed per slab. +- For `BT=32`, f32 Ai scratch at `npp=2048,npl=32` is: + - MoE H=32: `256 MiB`, with `768 MiB` total Ai write/read traffic. + - Dense H=48: `384 MiB`, with `1152 MiB` total Ai write/read traffic. + +Decision: + +- **GO** to a default-off Phase 13 prototype, not a shipped patch. +- Scope: `GDN_GLOBAL_AI32=1`, `BT=32`, f32 Ai, two `dv_tile=64` slabs. +- Reject if same-session A/B is flat/slower. If rejected, stop GDN kernel work + on GB10 rather than iterating into f16 Ai or more local reorders. + +Docs: + +- `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` +- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md` +- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md` + +### Phase 13 GDN Global-Ai32 update + +Phase 13 implemented the Phase 12 prototype behind `GDN_GLOBAL_AI32=1`: +precompute f32 Ai once per chunk/head, then consume it from two C32 +`dv_tile=64` value slabs. + +Result: + +- Correctness passed: + MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense + `5951a5b4d624ce891e22ab5fca9bc439`. +- Performance regressed: + - MoE 2048 S_PP `2425.10 -> 2097.76`. + - Dense 2048 S_PP `1016.14 -> 918.19`. + +Decision: + +- **REJECT** Global-Ai32. +- Do not add `0055`. +- Stop GDN kernel work on GB10. The shortcut space is exhausted by Phase 10, + Phase 11, and Phase 13 evidence; further GDN parity work needs a different + hardware regime or a larger FLA/CuteDSL-class implementation outside this + low-conflict LocalAI patch stack. + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/gates/` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/` +- `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff` + +### Phase 8 ragged MoE dispatch closure + +The remaining Phase 8 source shortcut was closed without production CUDA edits. +The live ragged serving profile showed helper metadata buckets too small to clear +the `+5%` serving A/B gate (`mm_ids=0.66%`, `gather_mmq=0.42%`). Patch `0023` +already handles the broadcast-activation NVFP4 path by quantizing unique tokens +once and gathering FP4 blocks, so a metadata-only `LLAMA_MOE_FUSED_DISPATCH` +hook would add conflict surface without attacking the dominant buckets. + +Safety rerun: + +- `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0. +- Full `MUL_MAT_ID`: `806/806` on CUDA0. +- MoE transcript md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5: `5951a5b4d624ce891e22ab5fca9bc439`. + +Decision: + +- Keep test patch `0053`. +- Do not add a Phase 8 production patch unless it directly reduces + `mmq_nvfp4` or activation movement without D2H id readback, new + synchronizations, or md5 drift. + +--- + +# PROFILE-VALIDATED PATH (both-engine nsys, adversarially verified Sun Jun 28 11:55:12 PM UTC 2026) + +## Prefill gap decomposition (paged 396 vs vLLM 197 us/tok) +All 4 runs ran on DGX (GB10) via ssh dgx.casa; GPU lock held+released, GPU restored idle. Model = decision MoE Qwen3.6-35B-A3B-NVFP4 (paged GGUF vs q36-35b-a3b-nvfp4-vllm). Buckets = % of GPU-kernel wall (nsys cuda_gpu_kern_sum), and per-prefill-token us. + +PAGED MoE PREFILL (npp512 ntg4 npl32, LLAMA_KV_PAGED=1 +LLAMA_MOE_FORCE_GRAPHS=1): S_PP=2417.8 t/s; kernel 6.485s/16384 tok = 395.9 us/tok. MoE-expert-GEMM(MMQ nvfp4) 26.5% | GDN 24.2% (gdn_core 17.2, gdn_gather 3.3, gdn_conv 2.7, l2norm 1.0) | layout-copy 9.8 (convert_dtype 6.3, concat 2.9) | ew-mul 8.7 | bf16-proj 8.6 | act-quant(quantize_mmq_nvfp4) 4.7 | ew-add 4.6 | silu/sigmoid-gate 4.3 | norms 3.6 | MoE-DISPATCH(argsort 0.4+mm_ids 1.1+gather_mmq 0.7) 2.2 | get_rows 1.0 | FA 0.6 | softmax 0.05 | scatter 0.06. + +vLLM MoE PREFILL (32x512, 5 reps): S_PP=4925.8 t/s; kernel 16.138s/81920 tok = 197.0 us/tok. SURPRISE: on sm_121 vLLM runs experts as Marlin W4A16 (FP4->bf16 dequant + bf16 GEMM), NOT fused-FP4 cutlass; projections are FP8 (sm89_xmma_e4m3). ew-glue(torch elementwise) 31.7% | MoE-expert-GEMM(Marlin) 24.6 | GDN(FLA chunk_* + causal_conv) 18.5 | bf16/fp8-proj 10.4 | reduce(cumsum/softmax) 5.2 | gate 2.3 | act-quant(scaled_fp8) 1.7 | layernorm 1.7 | MoE-DISPATCH(gather/align/count_sort/argsort) 1.4 | FA 1.1. + +Per-token gap decomposition (paged-vLLM, of 198.9 us/tok total): GDN +59.2 (~30%), MoE-GEMM +56.5 (~28%), ew/layout/glue net +21.4 (~11%), act-quant +15.2 (~8%), bf16-proj +13.7 (~7%), gate +12.4 (~6%), norms +11.1 (~6%), dispatch +5.9 (~3%). + +## Decode picture (host-bound, not kernel/graph-reuse) +3 decode profiles. KEY: paged decode KERNELS are 5.4x more GPU-efficient than vLLM's, but paged static decode is HOST-BOUND (GPU ~16% busy); vLLM is GPU-bound (99% busy) on a slow recurrent GDN. They tie at static-wide-128 (paged 782 vs vLLM ~819 t/s pure decode) via opposite regimes. + +PAGED DECODE-SERVING (staggered 128 clients, llama-server, steady 22s window, 83.5% GPU-busy): MoE/FFN-GEMM 40.7% (mmq 34.2 + gemv_moe 4.6 + gemv 1.4) | bf16-proj 22.8 (mul_mat_f 11.1 + nvjet 9.1 + cutlass 2.5) | GDN 21.2 (gdn_core 19.9) | act-quant 2.8 | layout 2.1 | get_rows 2.0 | ew-mul 2.0 | FA 1.6 | norms 1.2 | MoE-DISPATCH 1.1 | scatter 0.2 | softmax 0.1. + +PAGED STATIC npl=128 lockstep (PP128+TG256, ~16% GPU-busy, HOST-BOUND): kernel 7.83s/49152 tok=159 us/tok, S_TG=782 t/s. MoE-GEMM 37.5 | GDN 21.6 | layout 9.6 | bf16-proj 9.2 | ew-mul 5.5 | act-quant 4.1 | ew-add 3.4 | norms 2.5 | dispatch 1.8 | FA 0.55. cudaStreamSynchronize=43.4s (84% of API/87% of wall) vs 7.83s GPU kernel => GPU idle ~84%. + +PAGED STATIC npl=1 (batch-1): kernel 0.20s, MEMOPS 0.44s (68% of kern+mem), cudaStreamSync 66.7% => latency/BW-bound, GPU ~4% busy. + +vLLM 128-wide offline (PT128 GEN256, 99% GPU-busy): kernel 42.56s/49152 tok=866 us/tok. GDN 45.2% (fused_recurrent_gated_delta decode 42.8!) | MoE-GEMM(Marlin) 36.2 | bf16/fp8-proj 6.6 | ew-glue 6.3 | FA 2.1 | reduce 1.4 | dispatch 0.7. + +Per-token decode (paged static-128 | vLLM | ratio): MoE-GEMM 59.7|313.5 paged 5.3x faster; GDN 34.3|391.7 paged 11.4x faster; bf16-proj 14.7|57.2; total 159|866 paged 5.4x less GPU. + +H1 verdict (false): the stated mechanism - 'MUL_MAT_ID per-useful-token time growing static->serving from grouped-GEMV collapse' - is REFUTED at the kernel level. The grouped path engages correctly: at width-1 the MoE expert path is GEMV (mul_mat_vec_q), and at width>=~16 it switches to grouped MMQ (mul_mat_q nvfp4) - npl=128 is 37% MMQ/~0 GEMV, serving is 34% MMQ + 6% gemv_moe. It does NOT collapse to per-token GEMV. What IS confirmed (the real H1 mechanism) is HOST-SIDE SERIALIZATION: cudaStreamSynchronize dominates the static-decode wall - npl=1 66.7% of API time (~89% of wall), npl=128 84.3% of API time (43.4s sync vs 7.83s GPU kernel => GPU ~84% idle); the serving window logged 40,902 cudaStreamSynchronize. The grouped MMQ also runs at ragged small-M tiles (mmq_x = 16/24/32/40/48/64/80/96) because tokens-per-expert is tiny -> low tensor-core utilization (small-M MMQ, not a GEMV collapse). Mechanistically the device->host sync to read MoE routing before launching per-expert GEMMs is the serializer (task D1/#104 'no host-sync MoE path'). + +THE BIG DECODE PICTURE (most important finding): paged and vLLM have OPPOSITE decode profiles. Paged decode kernels are 5.4x more GPU-efficient (159 vs 866 us/tok) but paged static decode is host-bound (GPU ~16% busy, serial SSM+sampling+MoE-dispatch host loop); vLLM is GPU-bound (99% busy) on a recurrent GDN kernel that is 11x slower per token, but it saturates the GPU via CUDA graphs. They tie at static-wide-128 (782 vs ~819 t/s). At SERVING the paged GPU rises to 83.5% busy because overlapping request streams hide the host stalls - so the serving lever for paged is NOT faster decode kernels (they're already fast/idle) but (a) removing host serialization / graphing the whole step incl MoE dispatch, and (b) chunked-prefill: paged's 2x-slower prefill steals serving cycles during continuous batching (the gen-80-128 serving config was ~55% prefill work; the nsys'd run2 gen-256-512 ~25%). vLLM bf16/fp8 projections are a bigger paged decode bucket than expected (22.8% serving) because batch-1/small-batch bf16 proj uses mul_mat_f (11.1%) + nvjet (9.1%). + +Methodology/scope: profiled with nsys --trace=cuda + cuda_gpu_kern_sum; no NVTX in either engine so buckets are by kernel-name regex (bucketer at dgx:/home/mudler/bench/bucket2.py; reports at dgx:/home/mudler/bench/profgap/). Shared elementwise (k_bin_bcast add/mul, torch elementwise) straddle resid/MoE-fanin/GDN-glue and are bucketed by dominant use with that caveat; vLLM's torch_ew (31.7% prefill) is GDN-glue+MoE-combine+resid and is genuinely ambiguous. The dense Qwen3.6-27B-NVFP4 was NOT separately profiled (time budget; the MoE decision-model contains both MoE experts AND the same GDN/attention stack, fully answering A/B/C); GDN findings generalize to dense. vLLM decode here is offline 128-wide (continuous-batched), not staggered-server, so the cross-engine serving ratio is taken from prior h2h benches (~55-80% of vLLM at npl 64-128), not a fresh staggered vLLM run. Cross-engine 'gap' numbers are GPU-kernel-time per token (apples for GPU-bound prefill; for decode the host-bound vs GPU-bound asymmetry means wall-throughput parity hides a 5.4x GPU-efficiency paged advantage). + +## Decision +### moe_prefill_lever +BETTER GROUPED GEMM KERNEL (D2/#105), NOT P5 dispatch fusion. The profile settles this empirically: explicit MoE dispatch (argsort+softmax+get_rows+set_rows+mm_ids+gather_mmq) is only 8.6 us/tok (~2-3% of the paged prefill wall; +5.9 us/tok = ~3% of the gap). P5 is REJECTED as a standalone lever - and the premise it rests on ("vLLM fuses dispatch into the GEMM epilogue") is FALSE on GB10: vLLM runs Marlin W4A16 with its OWN separate dispatch kernels (count_and_sort_expert_tokens/moe_align/vectorized_gather/moe_sum, 2.7 us/tok). Dispatch is cheap in both engines; epilogue-fusing it buys ~3% at most. + +The real lever is the grouped GEMM: paged grouped-MMQ MUL_MAT_ID is 105 us/tok vs vLLM Marlin 48.5 us/tok = 2.16x slower, ~28% of the prefill gap (+56.5 us/tok). It does NOT collapse to GEMV - the grouped path engages correctly; it loses because ragged small-M-per-expert tiles (mmq_x 16-96) under-utilize tensor cores. + +Is it winnable given MMQ already beat our native kernel? YES in principle, but ONLY via a kernel approach we have NOT yet tried correctly. Both prior attempts failed for identifiable reasons: 0033 did dequant as a SEPARATE global-memory pass then cuBLAS (lost to fused FP4 MMQ 29-49%); 0034 native FP4-MMA W4A4 PoC did NOT hold in-backend. vLLM proves the winning shape on THIS EXACT silicon (sm_121, Marlin bf16 fallback - no native FP4) is IN-REGISTER FP4->bf16 dequant feeding bf16 mma.sync with cp.async pipelining + large/grouped tiles, and W4A16 means ZERO activation-quant. That second point is load-bearing: act-quant (quantize_mmq_nvfp4) is +15.2 us/tok = ~8% of the gap that vLLM STRUCTURALLY does not pay because it is W4A16. So a Marlin-style W4A16 grouped MoE-prefill GEMM is a combined ~36% prefill lever (GEMM 28% + act-quant 8%), and it is a DIFFERENT kernel from both rejects (not a separate-pass dequant, not native FP4-MMA). The README's "W4A16 rejected" verdict was DECODE-only (BW-bound, wash); prefill is compute-bound and the act-quant pass is M-proportional, so W4A16 for prefill is unaudited and the most promising structural fix. GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at the real ragged-small-M MoE-prefill shapes (NOT a standalone PoC - the exact lesson from rejecting native FP4-MMA); bit-exact via KL-gate for the bf16-dequant reduction-order change (paged-MoE 8cb0ce23 precedent). + +### gdn_build_go +True + +### gdn_rationale +GO on #101, with a Phase-1 in-backend kill-gate. The profile makes the regime check the scope doc demanded (TENSORCORE_GDN_SCOPE Phase 0) pass cleanly: (1) GDN is the #1 SINGLE contributor to the prefill gap at +59.2 us/tok (~30% of the gap), edging out MoE-GEMM (+56.5). (2) The cost is MATH-predominant, not layout/host: gdn_core (the hand-written FP32 chunked-scan, NOT tensor-core) is 17.2% of the wall; GDN-attributable layout (gdn_gather 3.3 + head-concat 2.9 + a convert_dtype slice) is only ~6-7% (~1/4). So tensor cores attack the dominant 3/4, and the 1/4 layout folds into the same fused kernel. (3) The headroom is MEASURED on identical silicon: vLLM's FLA chunked GDN runs the SAME math at 36.5 us/tok vs paged 95.7 = 2.62x, confirming the scope's "mma absorbs the O(C^2) intra-chunk flops so the Cx state-BW cut becomes a net win" mechanism. (4) Bonus dual payoff: it also chips the decode serial-SSM residual and, via continuous batching, the serving-decode lever (prefill steals ~25-55% of serving cycles). + +CONDITION (empirical guard, not PoC-optimism): 0031's chunking math was correct yet came back 22% SLOWER in-backend, and we JUST rejected native FP4-MMA because its standalone PoC win did not hold in-backend. So GO funds Phase 1 ONLY (two Gram products on mma.cuh tf32 tiles at fixed C=16/1-block-SM); it must move S_PP in a SEPARATELY-BUILT in-backend A/B vs the sequential scan. If Phase 1 is flat, the occupancy/register wall is the blocker, not the reductions - NO-GO the multi-week Phase 2/3 build. Precision gate is the KL-gate (tf32 default, 3xtf32 ladder), greedy md5 stability, plus the adversarial g in [-20,-1e-4] decay op case; ship opt-in default-off until a separately-built A/B beats sequential. + +### top_decode_lever +D1/#104 - the no-host-sync MoE decode path + full-step CUDA-graph capture (graph the WHOLE decode step INCLUDING MoE dispatch), targeting the device->host MoE-routing readback. Ranked decisively by the profile, NOT by raw GPU-bucket size: the dominant decode cost is not a GPU kernel at all - it is cudaStreamSynchronize, 84% of the static-decode wall (43.4s sync vs 7.83s GPU kernel; npl=1 66.7%, npl=128 84.3% of API time; 40,902 syncs in the serving window). Root cause = the device->host sync to read MoE routing before launching per-expert GEMMs. Paged decode KERNELS are already 5.4x more GPU-efficient than vLLM's and the GPU sits 84% idle in static decode, so D1 is the only decode lever that attacks the actual bottleneck. + +D2/D3/D4 for DECODE are all REJECTED by the methodology's "a faster kernel off the critical path benches flat" rule: D2 fused MoE decode GEMM - paged MoE-GEMM is already 5.3x faster/token than vLLM (59.7 vs 313.5 us/tok); making it faster just adds idle. D3 FA-split - FA is 1.6% of decode-serving wall / 0.55% static (H2 refuted; the hybrid is mostly GDN with few full-attn layers); not a lever. D4 GDN-width-adaptive - paged GDN decode is already 11.4x faster/token than vLLM (34 vs 392); H3 confirmed (flat across width, no amortization) but the recurrence is NOT the bottleneck, host serialization is - an occupancy retune yields ~nothing until the host loop is gone. + +Honest scope on D1's payoff: at HIGH-concurrency serving the paged GPU is already 83.5% busy because overlapping request streams hide the host stalls, so D1's win concentrates at LOW-concurrency / latency / batch-1 (GPU 4-16% busy), where it is large. The complementary serving-throughput lever is FIXING PREFILL (GDN #101 + MoE GEMM D2/#105): paged's 2x-slower prefill steals serving cycles under continuous batching (~25-55% of the serving step is prefill work) - so the prefill levers ARE also serving-decode levers. GATE: separately-built in-backend A/B (compiled-in, so a runtime flag does NOT isolate it) showing higher static/low-concurrency decode t/s with no high-concurrency-serving regression; bit-exact greedy md5 (graph replay re-issues identical kernels). + +### next_3_levers + +Post-Phase71 supersession: this ranked list is historical. `0047` already +ships the M5 tensor-core GDN path default-on under paged KV, Phase71 +revalidated it against sequential-disabled and serial-chunked baselines, and +Phase10/11/13 rejected the smaller follow-up GDN reorders. Phase41/43 closed +D1 on the current GB10 path unless a fresh route trace proves a host-sync +fallback returned. Phase60/61/66 rejected another small W4A16/direct-A or +quant/gather pass. Phase72 rejected min32 as a broad serving default, and +Phase73 set the active queue to datacenter-Blackwell rerun readiness or a +standalone GDN blocked-solve PoC before source work. Treat the list below as +pre-Phase60 planning context, not an active queue. + +Ranked, each with its pass-gate: + +1) #101 TENSOR-CORE mma CHUNKED GDN PREFILL KERNEL (prefill, GO). #1 prefill-gap contributor (+59 us/tok, ~30%), ~3/4 math (tensor cores help) with 2.62x measured headroom on identical silicon, 1/4 layout folds in; also helps serving decode. GATE: Phase-0 regime already satisfied by this profile; Phase-1 two-Gram-product PoC must move S_PP in a SEPARATELY-BUILT in-backend A/B vs sequential (flat => NO-GO the multi-week build); then KL-gate (tf32/3xtf32) + greedy md5 + adversarial-decay op test; ship opt-in default-off until A/B beats sequential. + +2) D1/#104 NO-HOST-SYNC MoE DECODE PATH + FULL-STEP CUDA-GRAPH CAPTURE (decode). Attacks the cudaStreamSynchronize that is 84% of the static-decode wall (the MoE-routing device->host readback). Lowest effort, bit-exact, highest-confidence decode win (concentrated at low-concurrency/latency). GATE: separately-built in-backend A/B (not a runtime-flag toggle) - higher static/low-concurrency decode t/s, no high-concurrency-serving regression; bit-exact greedy md5. + +3) D2/#105 MARLIN-STYLE W4A16 GROUPED MoE PREFILL GEMM (prefill). In-register FP4->bf16 dequant + bf16 mma.sync, cp.async, large grouped tiles - captures the 28% MoE-GEMM gap AND the 8% act-quant gap (W4A16 has no activation-quant), = ~36% combined; this is exactly what vLLM does on sm_121. Ranked #3 because of HIGH risk: two prior in-backend GEMM attempts failed (0033 separate-pass dequant, 0034 native FP4-MMA PoC didn't hold). GATE: must beat MMQ in a SEPARATELY-BUILT in-backend A/B at ragged-small-M MoE-prefill shapes (NOT a standalone PoC); bit-exact via KL-gate (bf16-dequant reduction order). + +Explicitly REJECTED/deprioritized (record so they aren't re-run): P5 dispatch fusion (~3%, and the "vLLM fuses dispatch" premise is false on GB10); D2-for-decode, D3 FA-split, D4 GDN-width-adaptive (their kernels are already 5-11x faster than vLLM and GPU-idle -> bench flat); padded/fixed-slot decode (already tested+rejected, commit b028c81e). + +### notes +Empirical discipline applied throughout (per the just-rejected native FP4-MMA): every funded lever is gated on a SEPARATELY-BUILT in-backend A/B, never a standalone PoC - 0031 (chunking math correct, -22% in-backend) and 0034 (PoC win, didn't hold) are the two cautionary precedents. Two compiled-in levers (#101, D1) cannot be isolated by a runtime flag, so they need build-vs-build A/B (methodology hard rule). + +Two profile surprises that reshape the directions: (a) vLLM on sm_121 is NOT native FP4 - it runs Marlin W4A16 (FP4->bf16 in-register dequant + bf16 GEMM) for experts and FP8 projections. So the winnable MoE-prefill GEMM is a W4A16-Marlin-style kernel (which also erases our 8% act-quant tax), not another native-FP4 attempt. (b) Decode is a regime asymmetry, not a kernel gap: paged decode kernels are 5.4x more GPU-efficient than vLLM's but paged static decode is HOST-BOUND (GPU 84% idle on cudaStreamSynchronize); vLLM is GPU-bound at 99% on a recurrence 11x slower/token. They tie at static-wide-128. Hence "make decode kernels faster" is the wrong instinct (benches flat); "remove host serialization / graph the full step" (D1) and "fix prefill so it stops stealing serving cycles" (#101, D2) are the decode-serving levers. + +Cross-cutting: the prefill levers (#101 GDN, D2 MoE GEMM) double as serving-decode levers because continuous batching interleaves ~25-55% prefill work into the serving step. GDN edges MoE-GEMM as the top prefill pick (bigger gap, cleaner math mechanism, 2.6x proven headroom, lower in-backend risk, dual payoff). + +All numbers from the both-engine nsys profile (cuda_gpu_kern_sum buckets, bucketer dgx:/home/mudler/bench/bucket2.py, reports dgx:/home/mudler/bench/profgap/); caveats: no NVTX (kernel-name regex buckets); shared elementwise straddles resid/MoE-fanin/GDN-glue; vLLM decode is offline 128-wide, not staggered-server. Relevant repo paths (absolute): /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/{TENSORCORE_GDN_SCOPE.md,TENSORCORE_GDN_BUILD_PLAN.md,VLLM_PARITY_LEVER_MAP.md,PREFILL_GEMM_SCOPE.md,PREFILL_GEMM_RESULTS.md,DECODE_SERVING_SCOPE.md,PAGED_BITEXACT_NOTE.md,final_benchmark.csv}; patches dir .../patches/paged/ (existing 0031 chunked-GDN serial, 0033 dequant->cuBLAS rejected, 0034 native FP4-MMA, 0040/0041 S1/S3 decode-graph, 0042 fused residual+RMSNorm); methodology /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/.agents/vllm-parity-methodology.md. diff --git a/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv new file mode 100644 index 000000000000..10bde6a1cbc0 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/final_benchmark.csv @@ -0,0 +1,25 @@ +model,engine,npl,decode_agg_tps,prefill_tps +q36-27b-nvfp4,llama-stock,8,68.3,937.7 +q36-27b-nvfp4,llama-stock,32,119.9,885.2 +q36-27b-nvfp4,llama-stock,64,142.8,885.1 +q36-27b-nvfp4,llama-stock,128,155.1,887.2 +q36-27b-nvfp4,llama-patched,8,85.3,915.1 +q36-27b-nvfp4,llama-patched,32,211.9,919.0 +q36-27b-nvfp4,llama-patched,64,305.2,923.5 +q36-27b-nvfp4,llama-patched,128,382.1,922.9 +q36-27b-nvfp4,vllm,8,70.4,2096.2 +q36-27b-nvfp4,vllm,32,211.8,2182.6 +q36-27b-nvfp4,vllm,64,309.1,2088.9 +q36-27b-nvfp4,vllm,128,418.8,1929.1 +q36-35b-a3b-nvfp4,llama-stock,8,186.7,1501.5 +q36-35b-a3b-nvfp4,llama-stock,32,267.4,1856.8 +q36-35b-a3b-nvfp4,llama-stock,64,320.5,1949.5 +q36-35b-a3b-nvfp4,llama-stock,128,347.2,1995.4 +q36-35b-a3b-nvfp4,llama-patched,8,230.3,1510.3 +q36-35b-a3b-nvfp4,llama-patched,32,466.4,1969.2 +q36-35b-a3b-nvfp4,llama-patched,64,622.4,2122.8 +q36-35b-a3b-nvfp4,llama-patched,128,784.3,2177.0 +q36-35b-a3b-nvfp4,vllm,8,256.5,5186.5 +q36-35b-a3b-nvfp4,vllm,32,500.8,6223.4 +q36-35b-a3b-nvfp4,vllm,64,686.1,5926.5 +q36-35b-a3b-nvfp4,vllm,128,882.2,5300.5 diff --git a/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp b/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp new file mode 100644 index 000000000000..6df252fdb364 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-burst-bench.cpp @@ -0,0 +1,217 @@ +// Paged-pool burst-degradation repro (patch 0024). DEV SCAFFOLDING ONLY. +// +// Reproduces, at the libllama level, the two host-side defects behind the +// "later lower-npl prefill collapses, decode fine, restart cures it" benchmark +// signature: +// +// * RECLAMATION GAP (Fix-1): a partial tail seq_rm(seq, p0>0, -1) - exactly +// what llama-server issues on every reused slot - frees the kv-cache CELLS +// but the paged manager keeps owning the trailing BLOCKS. The manager's +// free pool silently shrinks. Test A measures the reclaimed-block delta. +// +// * FRAGMENTATION / NO COMPACTION (Fix-2): a high-fan-out burst that allocates +// many sequences and frees them in a scrambled order leaves the free queue a +// scrambled permutation of physical block ids. A later low-npl prefill then +// pops physically scattered blocks, so its KV scatter-write + in-kernel +// paged-attention gather lose locality and prefill throughput collapses; +// decode (single-token append) barely notices. Test B times an npl8 prefill +// on a FRESH pool vs an npl8 prefill AFTER a scrambling burst+drain. +// +// PASS (post-fix): Test A reclaims ceil((PP-KEEP)/bs) trailing blocks on the +// partial seq_rm (0 pre-fix); Test B's post-burst npl8 prefill_tps is within ~10% +// of the fresh npl8 and num_free returns to the pristine value after the drain. +// +// Run with LLAMA_KV_PAGED=1. Env: BURST_NSLOT(64) NPL(8) PP(512) KEEP(256) +// GEN(4) PAGED_NGL(99). All sequences use distinct content so nothing is shared. + +#include "llama.h" +#include "paged-prefix-api.h" + +#include +#include +#include +#include +#include +#include + +static int env_i(const char * k, int dflt) { const char * v = getenv(k); return v ? atoi(v) : dflt; } + +using clk = std::chrono::steady_clock; +static double secs(clk::time_point a, clk::time_point b) { + return std::chrono::duration(b - a).count(); +} + +struct Ctx { llama_context * ctx; llama_memory_t mem; llama_batch batch; int n_vocab; }; + +// Deterministic, content-distinct token for (seq, pos): keeps every sequence's +// blocks unique so no cross-request prefix sharing masks the accounting. +static llama_token tok_of(int seq, int pos, int n_vocab) { + return (llama_token) (((seq * 1000003 + pos * 131 + 7) % (n_vocab - 200)) + 100); +} + +// Prefill n tokens of seq at [pos0, pos0+n) in one ubatch (n <= n_batch). +// Returns wall seconds (sync'd). +static double prefill(Ctx & C, int seq, int pos0, int n) { + clk::time_point t0 = clk::now(); + C.batch.n_tokens = 0; + for (int j = 0; j < n; ++j) { + int i = C.batch.n_tokens; + C.batch.token[i] = tok_of(seq, pos0 + j, C.n_vocab); + C.batch.pos[i] = pos0 + j; + C.batch.n_seq_id[i] = 1; + C.batch.seq_id[i][0]= seq; + C.batch.logits[i] = (j + 1 == n) ? 1 : 0; + C.batch.n_tokens++; + } + if (llama_decode(C.ctx, C.batch)) { fprintf(stderr, "prefill decode failed seq=%d\n", seq); return -1; } + llama_synchronize(C.ctx); + return secs(t0, clk::now()); +} + +// One decode step (single token) for seq at pos. +static void decode1(Ctx & C, int seq, int pos) { + C.batch.n_tokens = 1; + C.batch.token[0] = tok_of(seq, pos, C.n_vocab); + C.batch.pos[0] = pos; C.batch.n_seq_id[0] = 1; C.batch.seq_id[0][0] = seq; C.batch.logits[0] = 1; + if (llama_decode(C.ctx, C.batch)) fprintf(stderr, "decode1 failed seq=%d\n", seq); +} + +int main(int argc, char ** argv) { + std::setlocale(LC_NUMERIC, "C"); + const char * model_path = nullptr; + for (int i = 1; i < argc; ++i) if (!strcmp(argv[i], "-m") && i + 1 < argc) model_path = argv[++i]; + if (!model_path) { fprintf(stderr, "usage: %s -m model.gguf\n", argv[0]); return 2; } + + const int NSLOT = env_i("BURST_NSLOT", 64); + const int NPL = env_i("NPL", 8); + const int PP = env_i("PP", 512); + const int KEEP = env_i("KEEP", 256); + const int GEN = env_i("GEN", 4); + const int ngl = env_i("PAGED_NGL", 99); + const bool paged = getenv("LLAMA_KV_PAGED") != nullptr; + + ggml_backend_load_all(); + llama_model_params mp = llama_model_default_params(); + mp.n_gpu_layers = ngl; + llama_model * model = llama_model_load_from_file(model_path, mp); + if (!model) { fprintf(stderr, "model load failed\n"); return 1; } + const llama_vocab * vocab = llama_model_get_vocab(model); + const int n_vocab = llama_vocab_n_tokens(vocab); + + // Pool sized for the burst plus headroom so the burst fits but a later npl + // run draws from whatever the burst's churn left behind. + const long cells = (long) (NSLOT + NPL + 4) * (PP + GEN + 16); + llama_context_params cp = llama_context_default_params(); + cp.n_ctx = (uint32_t) cells; + cp.n_batch = (uint32_t) (PP + 16); + cp.n_ubatch = (uint32_t) (PP + 16); + cp.n_seq_max = NSLOT + NPL + 2; + cp.kv_unified = true; // one unified stream-0 pool -> num_free(ctx) is the whole pool + cp.no_perf = true; + llama_context * ctx = llama_init_from_model(model, cp); + if (!ctx) { fprintf(stderr, "ctx init failed (cells=%ld)\n", cells); return 1; } + + Ctx C; C.ctx = ctx; C.mem = llama_get_memory(ctx); C.n_vocab = n_vocab; + C.batch = llama_batch_init(cp.n_batch, 0, 1); + + printf("== paged-burst-bench == paged=%d NSLOT=%d NPL=%d PP=%d KEEP=%d GEN=%d n_ctx=%ld\n", + paged, NSLOT, NPL, PP, KEEP, GEN, cells); + + llama_memory_clear(C.mem, true); + const long F_start = paged_prefix_api::num_free_global(); + + // ---- Test A: Fix-1 reclamation gap on a partial tail seq_rm -------------- + { + prefill(C, 0, 0, PP); + const long f_after_prefill = paged_prefix_api::num_free_global(); + llama_memory_seq_rm(C.mem, 0, KEEP, -1); // partial tail removal + const long f_after_rm = paged_prefix_api::num_free_global(); + llama_memory_seq_rm(C.mem, 0, -1, -1); // full free -> pristine + const long f_after_full = paged_prefix_api::num_free_global(); + const long bs = 16; + const long expect = (PP + bs - 1)/bs - (KEEP + bs - 1)/bs; // trailing blocks + printf("[TEST-A Fix-1] start=%ld afterPrefill=%ld afterPartialRm=%ld reclaimed=%ld " + "(expect %ld post-fix, 0 pre-fix) afterFullFree=%ld\n", + F_start, f_after_prefill, f_after_rm, f_after_rm - f_after_prefill, expect, f_after_full); + } + + // ---- Test B: fragmentation -> npl prefill collapse ----------------------- + // Fresh npl prefill baseline on a pristine pool. + llama_memory_clear(C.mem, true); + double tps_fresh; + { + clk::time_point t0 = clk::now(); + long ntok = 0; + for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; } + tps_fresh = ntok / secs(t0, clk::now()); + for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1); + } + const long F_pristine = paged_prefix_api::num_free_global(); + + // High-fan-out burst: allocate NSLOT sequences, each prefilled + a few decode + // steps (mixed alloc), then drain them in a scrambled order (odd ids first, + // then even, each truncated before the full free) so the free queue becomes a + // scrambled permutation - the fragmentation the bug never compacts. + for (int s = 0; s < NSLOT; ++s) { + if (prefill(C, NPL + s, 0, PP) < 0) return 1; + for (int g = 0; g < GEN; ++g) decode1(C, NPL + s, PP + g); + } + const long F_during_burst = paged_prefix_api::num_free_global(); + // Drain: partial tail seq_rm (the reused-slot pattern) then full free, in a + // scrambled slot order to scramble the physical free order. + for (int parity = 1; parity >= 0; --parity) + for (int s = 0; s < NSLOT; ++s) if ((s & 1) == parity) { + llama_memory_seq_rm(C.mem, NPL + s, KEEP, -1); // partial (Fix-1 path) + llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // full free + } + const long F_after_drain = paged_prefix_api::num_free_global(); + + // Post-burst npl prefill: pops from the (pre-fix scrambled / post-fix + // defragged) free queue. + double tps_post; + { + clk::time_point t0 = clk::now(); + long ntok = 0; + for (int s = 0; s < NPL; ++s) { double d = prefill(C, s, 0, PP); if (d < 0) return 1; ntok += PP; } + tps_post = ntok / secs(t0, clk::now()); + for (int s = 0; s < NPL; ++s) llama_memory_seq_rm(C.mem, s, -1, -1); + } + + const double ratio = tps_fresh > 0 ? tps_post / tps_fresh : 0; + printf("[TEST-B frag] num_free: start=%ld pristine=%ld duringBurst=%ld afterDrain=%ld " + "(afterDrain==pristine? %s)\n", + F_start, F_pristine, F_during_burst, F_after_drain, + F_after_drain == F_pristine ? "YES" : "NO"); + printf("[TEST-B frag] prefill_tps fresh=%.1f post-burst=%.1f ratio=%.3f " + "(PASS if >=0.90)\n", tps_fresh, tps_post, ratio); + + // ---- Test C: idle-slot retention leak -> reclaim (the Fix-3 scenario) ----- + // Burst NSLOT sequences and leave them IDLE (stock llama-server keeps an idle + // slot's KV; the blocks are stranded). F_idle shows the depleted pool a later + // low-npl run would see. Then full-seq_rm each (exactly what Fix-3's + // prompt_clear() issues at slot.release): F_reclaimed must return to pristine. + llama_memory_clear(C.mem, true); + // Touch the pool once so the manager exists, then read the full-pool size + // (num_free is 0 while no manager is registered). + if (prefill(C, 0, 0, 16) < 0) return 1; + llama_memory_seq_rm(C.mem, 0, -1, -1); + const long F_pre_c = paged_prefix_api::num_free_global(); + for (int s = 0; s < NSLOT; ++s) { if (prefill(C, NPL + s, 0, PP) < 0) return 1; } + const long F_idle = paged_prefix_api::num_free_global(); + for (int s = 0; s < NSLOT; ++s) llama_memory_seq_rm(C.mem, NPL + s, -1, -1); // Fix-3 release + const long F_reclaimed = paged_prefix_api::num_free_global(); + printf("[TEST-C idle] pristine=%ld idle_after_burst=%ld (leaked=%ld) reclaimed=%ld " + "(returns_to_fresh? %s)\n", + F_pre_c, F_idle, F_pre_c - F_idle, F_reclaimed, + F_reclaimed == F_pre_c ? "YES" : "NO"); + + printf("RESULT paged=%d frag_fix2_ratio=%.3f drain_numfree_returns=%s idle_reclaim_returns=%s\n", + paged, ratio, + F_after_drain == F_pristine ? "YES" : "NO", + F_reclaimed == F_pre_c ? "YES" : "NO"); + + llama_batch_free(C.batch); + llama_free(ctx); + llama_model_free(model); + return 0; +} diff --git a/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp b/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp new file mode 100644 index 000000000000..e81b1c663f64 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/docs/paged-reclaim-unit.cpp @@ -0,0 +1,59 @@ +// Host-side unit test for the paged-pool burst-reclaim fix (patch 0024). +// Compiles paged-kv-manager.cpp directly; no ggml / llama / GPU dependency. +// +// Fix-1 PagedKVManager::truncate(seq, n_keep) reclaims the trailing blocks +// beyond ceil(n_keep/bs) (ref-counted), so a partial tail seq_rm no +// longer strands blocks whose cells were cleared. +// Fix-2 defrag_free_pool() relinks the free queue into ascending block-id +// order once the pool is fully idle, undoing a burst's scrambled frees +// so a later prefill pops physically contiguous blocks again. + +#include "paged-kv-manager.h" +#include + +using paged::PagedKVManager; + +int main() { + int rc = 0; + + // ---- Fix-1: truncate reclaims the trailing block suffix ----------------- + { + PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/true); + const size_t f0 = m.num_free_blocks(); // 63 (block 0 reserved as null) + m.allocate(0, 512); // ceil(512/16)=32 blocks + const size_t f1 = m.num_free_blocks(); // 31 + m.truncate(0, 256); // keep ceil(256/16)=16, free 16 + const size_t f2 = m.num_free_blocks(); // 47 + printf("[unit Fix-1] free=%zu alloc512=%zu truncate256=%zu reclaimed=%zu (expect 16)\n", + f0, f1, f2, f2 - f1); + if (f2 - f1 != 16) rc = 1; + m.truncate(0, 16); // keep 1 block, free 15 more + const size_t f3 = m.num_free_blocks(); // 62 + printf("[unit Fix-1] truncate16=%zu (expect %zu)\n", f3, f0 - 1); + if (f3 != f0 - 1) rc = 1; + m.free(0); + if (m.num_free_blocks() != f0) { printf("[unit Fix-1] free mismatch\n"); rc = 1; } + } + + // ---- Fix-2: defrag restores ascending popleft order --------------------- + { + PagedKVManager m(/*num_blocks=*/64, /*block_size=*/16, /*caching=*/false); + for (int s = 0; s < 8; ++s) m.allocate(s, 16); // pop blocks 1..8 + const int scrambled[8] = {3, 7, 1, 5, 0, 6, 2, 4}; // free out of order + for (int i = 0; i < 8; ++i) m.free(scrambled[i]); + m.defrag_free_pool(); // all idle -> compact + m.allocate(100, 16 * 3); // pop 3 blocks + const auto bt = m.block_table(100); + bool asc = true; + printf("[unit Fix-2] post-defrag block_table:"); + for (size_t i = 0; i < bt.size(); ++i) { + printf(" %d", bt[i]); + if (i && bt[i] < bt[i - 1]) asc = false; + } + printf(" ascending=%s (expect YES)\n", asc ? "YES" : "NO"); + if (!asc) rc = 1; + } + + printf("UNIT %s\n", rc == 0 ? "PASS" : "FAIL"); + return rc; +} diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png new file mode 100644 index 000000000000..bec4bbd41b78 Binary files /dev/null and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_decode_overview.png differ diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png new file mode 100644 index 000000000000..1dd5cf000ed1 Binary files /dev/null and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_dense_decode_vs_npl.png differ diff --git a/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png new file mode 100644 index 000000000000..680fd10db427 Binary files /dev/null and b/backend/cpp/llama-cpp-localai-paged/docs/qwen36_moe_decode_vs_npl.png differ diff --git a/backend/cpp/llama-cpp-localai-paged/package.sh b/backend/cpp/llama-cpp-localai-paged/package.sh new file mode 100755 index 000000000000..ac30467d0621 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/package.sh @@ -0,0 +1,66 @@ +#!/bin/bash + +# Script to copy the appropriate libraries based on architecture +# This script is used in the final stage of the Dockerfile + +set -e + +CURDIR=$(dirname "$(realpath $0)") +REPO_ROOT="${CURDIR}/../../.." + +# Create lib directory +mkdir -p $CURDIR/package/lib + +cp -avrf $CURDIR/llama-cpp-localai-paged-* $CURDIR/package/ +cp -rfv $CURDIR/run.sh $CURDIR/package/ + +# Bundle the ggml shared backends from the CPU_ALL_VARIANTS build into package/lib. ggml +# discovers the per-microarch libggml-cpu-*.so by scanning the executable directory, which +# (via the bundled lib/ld.so that run.sh launches through) resolves to lib/. See the +# matching comment in backend/cpp/llama-cpp/package.sh. No-op on the fallback/ROCm builds. +if [ -d "$CURDIR/ggml-shared-libs" ]; then + echo "Bundling ggml shared backends (CPU_ALL_VARIANTS)..." + cp -avf $CURDIR/ggml-shared-libs/*.so* $CURDIR/package/lib/ +fi + +# Detect architecture and copy appropriate libraries +if [ -f "/lib64/ld-linux-x86-64.so.2" ]; then + # x86_64 architecture + echo "Detected x86_64 architecture, copying x86_64 libraries..." + cp -arfLv /lib64/ld-linux-x86-64.so.2 $CURDIR/package/lib/ld.so + cp -arfLv /lib/x86_64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/x86_64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/x86_64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/x86_64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +elif [ -f "/lib/ld-linux-aarch64.so.1" ]; then + # ARM64 architecture + echo "Detected ARM64 architecture, copying ARM64 libraries..." + cp -arfLv /lib/ld-linux-aarch64.so.1 $CURDIR/package/lib/ld.so + cp -arfLv /lib/aarch64-linux-gnu/libc.so.6 $CURDIR/package/lib/libc.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgcc_s.so.1 $CURDIR/package/lib/libgcc_s.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libstdc++.so.6 $CURDIR/package/lib/libstdc++.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libm.so.6 $CURDIR/package/lib/libm.so.6 + cp -arfLv /lib/aarch64-linux-gnu/libgomp.so.1 $CURDIR/package/lib/libgomp.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libdl.so.2 $CURDIR/package/lib/libdl.so.2 + cp -arfLv /lib/aarch64-linux-gnu/librt.so.1 $CURDIR/package/lib/librt.so.1 + cp -arfLv /lib/aarch64-linux-gnu/libpthread.so.0 $CURDIR/package/lib/libpthread.so.0 +else + echo "Error: Could not detect architecture" + exit 1 +fi + +# Package GPU libraries based on BUILD_TYPE +GPU_LIB_SCRIPT="${REPO_ROOT}/scripts/build/package-gpu-libs.sh" +if [ -f "$GPU_LIB_SCRIPT" ]; then + echo "Packaging GPU libraries for BUILD_TYPE=${BUILD_TYPE:-cpu}..." + source "$GPU_LIB_SCRIPT" "$CURDIR/package/lib" + package_gpu_libs +fi + +echo "Packaging completed successfully" +ls -liah $CURDIR/package/ +ls -liah $CURDIR/package/lib/ diff --git a/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh new file mode 100755 index 000000000000..e069765a42bd --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh @@ -0,0 +1,439 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: paged-current-serving-snapshot.sh [--summarize-gates ART] + +Run a current-stack paged llama.cpp vs vLLM MoE serving snapshot on DGX. + +This harness uses the clean llama.cpp mirror by default, not stale development +trees. It runs pre/post paged inference gates, then a same-session serving +comparison with the h2h client. + +Environment overrides: + SRC llama.cpp source dir (default: ~/llama-phase6-source) + BUILD_DIR llama.cpp CMake build dir (default: $SRC/build-cuda) + BIN llama.cpp build bin dir (default: $SRC/build-cuda/bin) + MODEL paged GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) + VLLM_MODEL vLLM model dir (default: ~/bench/q36-35b-a3b-nvfp4-vllm) + SERVED_MODEL_NAME OpenAI model name used by llama-server, vLLM, and h2h (default: q36) + H2H h2h client (default: ~/bench/h2h_cli3.py) + ART artifact dir (default: ~/bench/phase_current_serving_snapshot/) + NPL concurrency list (default: "8 32 128") + PTOK prompt filler words (default: 128) + GEN generated tokens (default: 64) + CTX llama-server context (default: 131072) + PARALLEL llama-server parallel slots (default: 128) + BATCH llama-server logical batch (default: 2048) + UBATCH llama-server physical batch (default: 512) + LLAMA_PORT llama-server port (default: 8098) + LLAMA_READY_ATTEMPTS llama-server readiness attempts, one per second (default: 240) + VLLM_PORT vLLM port (default: 8000) + VLLM_BIN vLLM executable (default: ~/vllm-bench/bin/vllm) + VLLM_READY_ATTEMPTS vLLM readiness attempts, one per second (default: 600) + VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85) + VLLM_MAX_MODEL_LEN vLLM --max-model-len (default: 4096) + VLLM_MAX_NUM_SEQS vLLM --max-num-seqs (default: 256) + VLLM_TENSOR_PARALLEL_SIZE vLLM --tensor-parallel-size (default: 1) + VLLM_EXTRA_ARGS whitespace-split extra args appended to vLLM serve (default: empty) + SKIP_GATES=1 to skip pre/post paged inference gates + DRY_RUN=1 validate inputs/preflight, write hardware.txt, and print commands without running servers + +Options: + --summarize-gates ART write ART/gate_summary.tsv from existing gate_pre/gate_post artifacts +EOF +} + +SUMMARY_GATES_ART="" +case "${1:-}" in + -h|--help) + usage + exit 0 + ;; + --summarize-gates) + if [[ -z "${2:-}" ]]; then + usage >&2 + exit 2 + fi + SUMMARY_GATES_ART="$2" + ;; + "") + ;; + *) + usage >&2 + exit 2 + ;; +esac + +SRC=${SRC:-"$HOME/llama-phase6-source"} +BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"} +BIN=${BIN:-"$BUILD_DIR/bin"} +MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} +VLLM_MODEL=${VLLM_MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4-vllm"} +SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-q36} +H2H=${H2H:-"$HOME/bench/h2h_cli3.py"} +ART=${ART:-"$HOME/bench/phase_current_serving_snapshot/$(date +%Y%m%d_%H%M%S)"} +NPL=${NPL:-"8 32 128"} +PTOK=${PTOK:-128} +GEN=${GEN:-64} +CTX=${CTX:-131072} +PARALLEL=${PARALLEL:-128} +BATCH=${BATCH:-2048} +UBATCH=${UBATCH:-512} +LLAMA_PORT=${LLAMA_PORT:-8098} +LLAMA_READY_ATTEMPTS=${LLAMA_READY_ATTEMPTS:-240} +VLLM_PORT=${VLLM_PORT:-8000} +VLLM_BIN=${VLLM_BIN:-"$HOME/vllm-bench/bin/vllm"} +VLLM_READY_ATTEMPTS=${VLLM_READY_ATTEMPTS:-600} +VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85} +VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} +VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256} +VLLM_TENSOR_PARALLEL_SIZE=${VLLM_TENSOR_PARALLEL_SIZE:-1} +VLLM_EXTRA_ARGS=${VLLM_EXTRA_ARGS:-} +SKIP_GATES=${SKIP_GATES:-0} +DRY_RUN=${DRY_RUN:-0} +MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0 +DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439 + +LOCK_DIR="$HOME/gpu_bench_lock" +OWNER="$LOCK_DIR/owner" +SERVER_PID="" + +log() { + printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log" +} + +require_path() { + if [[ ! -e "$1" ]]; then + echo "missing required path: $1" >&2 + exit 2 + fi +} + +preflight() { + mkdir -p "$ART" + local docker_count local_ai compute owner + docker_count=$(docker ps -q | wc -l) + local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l) + owner="FREE-no-lock-file" + if [[ -f "$OWNER" ]]; then + owner=$(cat "$OWNER") + fi + { + echo "docker=$docker_count" + echo "local_ai_worker=$local_ai" + echo "compute=$compute" + echo "$owner" + } | tee "$ART/preflight.txt" + [[ "$docker_count" == "0" ]] + [[ "$local_ai" == "0" ]] + [[ "$compute" == "0" ]] + case "$owner" in + FREE*|FREE-no-lock-file) ;; + *) echo "GPU lock is busy: $owner" >&2; exit 3 ;; + esac +} + +write_hardware_report() { + local out="$ART/hardware.txt" + local gpu_name hardware_class + + gpu_name=$(nvidia-smi --query-gpu=name --format=csv,noheader 2>/dev/null | head -1 || true) + hardware_class="unknown" + case "$gpu_name" in + *B200*|*B100*|*GB200*) hardware_class="datacenter_blackwell" ;; + *H200*|*H100*) hardware_class="datacenter_other" ;; + *GB10*|*"DGX Spark"*|*RTX*|*"PRO 6000"*) hardware_class="gb10_or_workstation_blackwell" ;; + esac + + { + echo "nvidia_smi_L:" + nvidia-smi -L || true + echo + echo "nvidia_smi_query:" + if ! nvidia-smi --query-gpu=name,driver_version,memory.total,compute_cap --format=csv,noheader; then + nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader || true + fi + echo + echo "gpu_name=$gpu_name" + echo "hardware_class=$hardware_class" + case "$hardware_class" in + datacenter_blackwell) + echo "parity_note=datacenter Blackwell hardware: full parity methodology can choose new levers" + ;; + datacenter_other) + echo "parity_note=datacenter non-Blackwell hardware: do not generalize GB10 parity decisions" + ;; + gb10_or_workstation_blackwell) + echo "parity_note=GB10/workstation Blackwell hardware: GB10 shortcut closures apply unless new evidence says otherwise" + ;; + *) + echo "parity_note=unknown hardware: classify before making parity claims" + ;; + esac + } > "$out" + log "hardware report: $out" +} + +acquire_lock() { + mkdir -p "$LOCK_DIR" + echo "codex-current-serving-snapshot $(date +%s)" > "$OWNER" +} + +release_lock() { + stop_server_pid + pkill -9 -f "[l]lama-server.*--port $LLAMA_PORT" >/dev/null 2>&1 || true + pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true + mkdir -p "$LOCK_DIR" + echo "FREE released-by-codex-current-serving-snapshot $(date +%s)" > "$OWNER" +} + +stop_server_pid() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + for _ in $(seq 1 30); do + if ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then + break + fi + sleep 1 + done + if kill -0 "$SERVER_PID" >/dev/null 2>&1; then + kill -9 "$SERVER_PID" >/dev/null 2>&1 || true + fi + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi +} + +wait_http() { + local url="$1" + local pattern="$2" + local log_file="$3" + local health="$4" + local attempts="$5" + for _ in $(seq 1 "$attempts"); do + if curl --max-time 2 -fsS "$url" > "$health" 2>"$health.err" && grep -q "$pattern" "$health"; then + return 0 + fi + if [[ -n "$SERVER_PID" ]] && ! kill -0 "$SERVER_PID" >/dev/null 2>&1; then + tail -120 "$log_file" >&2 || true + return 1 + fi + sleep 1 + done + tail -120 "$log_file" >&2 || true + return 1 +} + +run_gate() { + local name="$1" + if [[ "$SKIP_GATES" == "1" ]]; then + log "skipping $name inference gate" + return + fi + log "running $name inference gate" + ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1 + cat "$ART/gate_$name.log" | tee -a "$ART/run.log" +} + +run_paged() { + local arm_dir="$ART/paged" + mkdir -p "$arm_dir" + log "starting paged current-stack server" + cd "$BIN" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + ./llama-server \ + -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \ + --parallel "$PARALLEL" --host 127.0.0.1 --port "$LLAMA_PORT" --no-webui \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + wait_http "http://127.0.0.1:$LLAMA_PORT/health" "ok" "$arm_dir/server.log" "$arm_dir/health.json" "$LLAMA_READY_ATTEMPTS" + python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \ + --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_paged_$(date +%s)" --no-cache >/dev/null + for n in $NPL; do + log "paged n=$n" + python3 "$H2H" --url "http://127.0.0.1:$LLAMA_PORT/v1/completions" \ + --model "$SERVED_MODEL_NAME" -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "paged_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + stop_server_pid + sleep 3 +} + +run_vllm() { + local arm_dir="$ART/vllm" + local extra_args=() + mkdir -p "$arm_dir" + export PATH="$(dirname "$VLLM_BIN"):$PATH" + export VLLM_LOGGING_LEVEL=${VLLM_LOGGING_LEVEL:-INFO} + export HF_HUB_OFFLINE=${HF_HUB_OFFLINE:-1} + if [[ -n "$VLLM_EXTRA_ARGS" ]]; then + read -r -a extra_args <<< "$VLLM_EXTRA_ARGS" + fi + log "starting vLLM server" + nohup env \ + -u VLLM_MODEL -u VLLM_BIN -u VLLM_READY_ATTEMPTS \ + -u VLLM_GPU_MEMORY_UTILIZATION -u VLLM_MAX_MODEL_LEN -u VLLM_MAX_NUM_SEQS \ + -u VLLM_TENSOR_PARALLEL_SIZE -u VLLM_EXTRA_ARGS \ + "$VLLM_BIN" serve "$VLLM_MODEL" \ + --served-model-name "$SERVED_MODEL_NAME" --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \ + --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \ + "${extra_args[@]}" \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" "$arm_dir/server.log" "$arm_dir/models.json" "$VLLM_READY_ATTEMPTS" + python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \ + --model "$SERVED_MODEL_NAME" -n 8 --ptok "$PTOK" --gen 16 --nonce "warm_vllm_$(date +%s)" --no-cache >/dev/null + for n in $NPL; do + log "vllm n=$n" + python3 "$H2H" --url "http://127.0.0.1:$VLLM_PORT/v1/completions" \ + --model "$SERVED_MODEL_NAME" -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "vllm_${n}_$(date +%s)" --no-cache > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + stop_server_pid + pkill -9 -u "$(id -u)" -f "[v]llm serve" >/dev/null 2>&1 || true + sleep 5 +} + +write_summary() { + python3 - "$ART" <<'PY' | tee "$ART/summary.tsv" +import json +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +rows = [] +for arm in ("paged", "vllm"): + for path in sorted((art / arm).glob("n*.json")): + data = json.loads(path.read_text()) + rows.append((arm, data["n"], data["agg_tps"], data["decode_agg_tps"], + data["decode_perseq_tps"], data["prefill_tps"], + data["ttft_mean_ms"], data["wall_s"])) + +print("arm\tn\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tprefill_tps\tttft_mean_ms\twall_s") +for row in rows: + print("\t".join(str(x) for x in row)) + +by_key = {(row[0], row[1]): row for row in rows} +print("\nratio\tn\tpaged_decode_over_vllm\tpaged_perseq_over_vllm\tpaged_agg_over_vllm\tpaged_ttft_over_vllm") +for n in sorted({row[1] for row in rows}): + paged = by_key.get(("paged", n)) + vllm = by_key.get(("vllm", n)) + if not paged or not vllm: + continue + print(f"ratio\t{n}\t{paged[3]/vllm[3]:.4f}\t{paged[4]/vllm[4]:.4f}\t{paged[2]/vllm[2]:.4f}\t{paged[6]/vllm[6]:.4f}") +PY +} + +write_gate_summary() { + python3 - "$ART" "$MOE_MD5_EXPECTED" "$DENSE_MD5_EXPECTED" <<'PY' | tee "$ART/gate_summary.tsv" +import re +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +expected = { + "moe": sys.argv[2], + "dense": sys.argv[3], +} +ansi = re.compile(r"\x1b\[[0-9;]*m") +bad = False + +print("phase\tcheck\tstatus\tactual\texpected\tdetails") + +for phase in ("pre", "post"): + gate_dir = art / f"gate_{phase}" + if not gate_dir.exists(): + print(f"{phase}\tall\tskipped\t\t\t{gate_dir} missing") + continue + + for name, want in expected.items(): + md5_path = gate_dir / f"{name}.md5" + if not md5_path.exists(): + print(f"{phase}\t{name}_md5\tmissing\t\t{want}\t{md5_path} missing") + bad = True + continue + got = md5_path.read_text().split()[0] + status = "ok" if got == want else "mismatch" + if status != "ok": + bad = True + print(f"{phase}\t{name}_md5\t{status}\t{got}\t{want}\t{md5_path}") + + op_paths = sorted(gate_dir.glob("op_*.txt")) + if not op_paths: + print(f"{phase}\top\tmissing\t\t\tno op_*.txt files") + bad = True + continue + + for path in op_paths: + op = path.stem.removeprefix("op_") + text = ansi.sub("", path.read_text(errors="replace")) + passed = re.search(r"(\d+)/(\d+) tests passed", text) + backend_ok = re.search(r"Backend CUDA0:\s+OK", text) + if passed: + actual = f"{passed.group(1)}/{passed.group(2)}" + status = "ok" if passed.group(1) == passed.group(2) and backend_ok else "fail" + else: + actual = "" + status = "missing" + if status != "ok": + bad = True + print(f"{phase}\top_{op}\t{status}\t{actual}\tall\t{path}") + +if bad: + sys.exit(6) +PY +} + +if [[ -n "$SUMMARY_GATES_ART" ]]; then + ART="$SUMMARY_GATES_ART" + require_path "$ART" + write_gate_summary + exit 0 +fi + +require_path "$SRC" +require_path "$BIN/llama-server" +require_path "$BIN/llama-completion" +require_path "$BIN/test-backend-ops" +require_path "$MODEL" +require_path "$VLLM_MODEL" +require_path "$H2H" +require_path "$VLLM_BIN" +require_path "$HOME/paged-inference-gates.sh" + +preflight +write_hardware_report +log "artifact=$ART" +log "source=$(git -C "$SRC" log --oneline -1)" + +if [[ "$DRY_RUN" == "1" ]]; then + log "dry run only; commands validated" + log "would build: cmake --build $BUILD_DIR --target llama-server llama-completion test-backend-ops -j8" + log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME" + log "readiness: LLAMA_READY_ATTEMPTS=$LLAMA_READY_ATTEMPTS VLLM_READY_ATTEMPTS=$VLLM_READY_ATTEMPTS" + log "would run paged NPL=[$NPL] PTOK=$PTOK GEN=$GEN" + log "would run vLLM NPL=[$NPL] PTOK=$PTOK GEN=$GEN" + log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]" + exit 0 +fi + +log "building llama-server, llama-completion, and test-backend-ops" +cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 \ + > "$ART/build.log" 2>&1 + +run_gate pre +acquire_lock +trap release_lock EXIT +run_paged +run_vllm +release_lock +trap - EXIT +run_gate post +write_gate_summary +write_summary +log "artifacts: $ART" diff --git a/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh b/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh new file mode 100755 index 000000000000..bbe4149e3d63 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh @@ -0,0 +1,136 @@ +#!/usr/bin/env bash +set -euo pipefail + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + cat <<'EOF' +Usage: paged-inference-gates.sh + +Run the LocalAI paged llama.cpp inference safety gates on a DGX checkout. + +Environment: + BIN llama.cpp build bin dir (default: ~/llama-phase6-source/build-cuda/bin) + MOE MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) + DENSE Dense GGUF path (default: ~/bench/q36-27b-nvfp4.gguf) + ART artifact dir (default: ~/bench/paged_inference_gates/) + OPS comma-separated test-backend-ops filters (default: MUL_MAT,MUL_MAT_ID) + EXTRA_ENV extra env assignments for completion gates, e.g. "GDN_TC=5" + +Expected md5: + MoE paged: 8cb0ce23777bf55f92f63d0292c756b0 + Dense paged: 5951a5b4d624ce891e22ab5fca9bc439 +EOF + exit 0 +fi + +MOE_MD5_EXPECTED=8cb0ce23777bf55f92f63d0292c756b0 +DENSE_MD5_EXPECTED=5951a5b4d624ce891e22ab5fca9bc439 + +BIN=${BIN:-"$HOME/llama-phase6-source/build-cuda/bin"} +MOE=${MOE:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} +DENSE=${DENSE:-"$HOME/bench/q36-27b-nvfp4.gguf"} +OPS=${OPS:-MUL_MAT,MUL_MAT_ID} +ART=${ART:-"$HOME/bench/paged_inference_gates/$(date +%Y%m%d_%H%M%S)"} +EXTRA_ENV=${EXTRA_ENV:-} + +require_file() { + if [[ ! -e "$1" ]]; then + echo "missing required path: $1" >&2 + exit 2 + fi +} + +check_idle() { + if command -v docker >/dev/null 2>&1; then + local docker_count + docker_count=$(docker ps -q | wc -l) + if [[ "$docker_count" != "0" ]]; then + echo "docker containers are running: $docker_count" >&2 + docker ps >&2 + exit 3 + fi + + local local_ai_worker + local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + if [[ "$local_ai_worker" != "0" ]]; then + echo "local-ai-worker container is running" >&2 + exit 3 + fi + fi + + if command -v nvidia-smi >/dev/null 2>&1; then + local compute_count + compute_count=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) + if [[ "$compute_count" != "0" ]]; then + echo "GPU compute processes are already running: $compute_count" >&2 + nvidia-smi >&2 + exit 3 + fi + fi + + local owner_file="$HOME/gpu_bench_lock/owner" + if [[ -f "$owner_file" ]]; then + local owner + owner=$(cat "$owner_file") + if [[ -n "$owner" && "$owner" != FREE* ]]; then + echo "GPU lock is owned: $owner" >&2 + exit 3 + fi + fi +} + +run_completion_gate() { + local name=$1 + local model=$2 + local expected=$3 + local out="$ART/${name}.txt" + local err="$ART/${name}.err" + local md5_file="$ART/${name}.md5" + + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 $EXTRA_ENV \ + "$BIN/llama-completion" -m "$model" -ngl 99 -fa on -c 4096 \ + --temp 0 --seed 1 -n 48 -p "The capital of France is" \ + "$out" 2>"$err" + + md5sum "$out" >"$md5_file" + local actual + actual=$(awk '{print $1}' "$md5_file") + if [[ "$actual" != "$expected" ]]; then + echo "$name md5 mismatch: got $actual expected $expected" >&2 + echo "artifacts: $ART" >&2 + exit 4 + fi + echo "$name md5 OK: $actual" +} + +run_op_gate() { + local op=$1 + local out="$ART/op_${op}.txt" + "$BIN/test-backend-ops" test -b CUDA0 -o "$op" -j 1 >"$out" 2>&1 + if ! grep -q "Backend CUDA0: .*OK" "$out"; then + echo "$op gate failed" >&2 + tail -80 "$out" >&2 + echo "artifacts: $ART" >&2 + exit 5 + fi + grep -E "[0-9]+/[0-9]+ tests passed|Backend CUDA0" "$out" | tail -2 +} + +mkdir -p "$ART" +require_file "$BIN/llama-completion" +require_file "$BIN/test-backend-ops" +require_file "$MOE" +require_file "$DENSE" +check_idle + +run_completion_gate moe "$MOE" "$MOE_MD5_EXPECTED" +run_completion_gate dense "$DENSE" "$DENSE_MD5_EXPECTED" + +IFS=',' read -r -a op_list <<<"$OPS" +for op in "${op_list[@]}"; do + op=${op//[[:space:]]/} + [[ -n "$op" ]] || continue + run_op_gate "$op" +done + +echo "paged inference gates OK" +echo "artifacts: $ART" diff --git a/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh b/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh new file mode 100755 index 000000000000..be4ef58b9261 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh @@ -0,0 +1,200 @@ +#!/usr/bin/env bash +set -euo pipefail + +usage() { + cat <<'EOF' +Usage: paged-mtp-serving-bench.sh + +Runs a direct llama-server serving A/B on DGX: + baseline: no speculative decoding + mtp: --spec-type draft-mtp + +Environment overrides: + SRC llama.cpp source dir (default: ~/llama-phase6-source) + BIN binary dir (default: $SRC/build-cuda/bin) + MODEL MoE GGUF path (default: ~/bench/q36-35b-a3b-nvfp4.gguf) + ART artifact dir (default: ~/bench/phase15_mtp_serving/) + PORT server port (default: 8097) + NPL comma/space list of concurrency values (default: "8 32 128") + PTOK prompt filler words for h2h_cli3.py (default: 128) + GEN max generated tokens (default: 128) + CTX server context (default: 131072) + PARALLEL server parallel slots (default: 128) + BATCH server logical batch size (default: 2048) + UBATCH server physical batch size (default: 512) + SKIP_GATES=1 to skip pre/post paged inference gates +EOF +} + +if [[ "${1:-}" == "-h" || "${1:-}" == "--help" ]]; then + usage + exit 0 +fi + +SRC=${SRC:-"$HOME/llama-phase6-source"} +BIN=${BIN:-"$SRC/build-cuda/bin"} +MODEL=${MODEL:-"$HOME/bench/q36-35b-a3b-nvfp4.gguf"} +ART=${ART:-"$HOME/bench/phase15_mtp_serving/$(date +%Y%m%d_%H%M%S)"} +PORT=${PORT:-8097} +NPL=${NPL:-"8 32 128"} +PTOK=${PTOK:-128} +GEN=${GEN:-128} +CTX=${CTX:-131072} +PARALLEL=${PARALLEL:-128} +BATCH=${BATCH:-2048} +UBATCH=${UBATCH:-512} +SKIP_GATES=${SKIP_GATES:-0} + +LOCK_DIR="$HOME/gpu_bench_lock" +OWNER="$LOCK_DIR/owner" +SERVER_PID="" + +log() { + printf '[%s] %s\n' "$(date -Is)" "$*" | tee -a "$ART/run.log" +} + +preflight() { + mkdir -p "$ART" + local docker_count local_ai compute owner + docker_count=$(docker ps -q | wc -l) + local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l) + owner="FREE-no-lock-file" + if [[ -f "$OWNER" ]]; then + owner=$(cat "$OWNER") + fi + { + echo "docker=$docker_count" + echo "local_ai_worker=$local_ai" + echo "compute=$compute" + echo "$owner" + } | tee "$ART/preflight.txt" + [[ "$docker_count" == "0" ]] + [[ "$local_ai" == "0" ]] + [[ "$compute" == "0" ]] + case "$owner" in + FREE*|FREE-no-lock-file) ;; + *) echo "GPU lock is busy: $owner" >&2; exit 2 ;; + esac +} + +acquire_lock() { + mkdir -p "$LOCK_DIR" + echo "codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER" +} + +release_lock() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi + mkdir -p "$LOCK_DIR" + echo "FREE released-by-codex-phase15-mtp-serving-bench $(date +%s)" > "$OWNER" +} + +wait_server() { + local health="$1" + for _ in $(seq 1 180); do + if curl -fsS "http://127.0.0.1:$PORT/health" > "$health" 2>"$health.err"; then + return 0 + fi + if ! kill -0 "$SERVER_PID" 2>/dev/null; then + return 1 + fi + sleep 1 + done + return 1 +} + +stop_server() { + if [[ -n "$SERVER_PID" ]]; then + kill "$SERVER_PID" >/dev/null 2>&1 || true + wait "$SERVER_PID" >/dev/null 2>&1 || true + SERVER_PID="" + fi +} + +run_gate() { + local name="$1" + if [[ "$SKIP_GATES" == "1" ]]; then + log "skipping $name inference gate" + return + fi + log "running $name inference gate" + ART="$ART/gate_$name" "$HOME/paged-inference-gates.sh" > "$ART/gate_$name.log" 2>&1 + cat "$ART/gate_$name.log" | tee -a "$ART/run.log" +} + +run_arm() { + local arm="$1" + shift + local arm_dir="$ART/$arm" + mkdir -p "$arm_dir" + log "starting $arm server" + cd "$BIN" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + ./llama-server \ + -m "$MODEL" -ngl 99 -fa on -c "$CTX" -b "$BATCH" -ub "$UBATCH" \ + --parallel "$PARALLEL" --host 127.0.0.1 --port "$PORT" --no-webui "$@" \ + > "$arm_dir/server.log" 2>&1 & + SERVER_PID=$! + if ! wait_server "$arm_dir/health.json"; then + tail -120 "$arm_dir/server.log" >&2 || true + exit 3 + fi + + for n in $NPL; do + log "running $arm n=$n" + python3 "$HOME/bench/h2h_cli3.py" \ + --url "http://127.0.0.1:$PORT/v1/completions" \ + --model m -n "$n" --ptok "$PTOK" --gen "$GEN" \ + --nonce "${arm}_${n}_$(date +%s)" --no-cache \ + > "$arm_dir/n${n}.json" + cat "$arm_dir/n${n}.json" | tee -a "$ART/run.log" + done + + grep -E "draft acceptance|statistics[[:space:]]+draft-mtp|speculative decoding context|bounded partial|backend sampling|common_speculative_impl_draft_mtp" \ + "$arm_dir/server.log" > "$arm_dir/spec_lines.txt" || true + stop_server +} + +preflight + +log "building llama-server and test-backend-ops" +cmake --build "$SRC/build-cuda" --target llama-server test-backend-ops llama-completion -j 8 \ + > "$ART/build.log" 2>&1 + +if [[ ! -x "$HOME/paged-inference-gates.sh" ]]; then + echo "missing $HOME/paged-inference-gates.sh; copy paged-inference-gates.sh there first" >&2 + exit 4 +fi + +run_gate pre +acquire_lock +trap release_lock EXIT +run_arm baseline +run_arm mtp --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling +release_lock +trap - EXIT +run_gate post + +python3 - "$ART" <<'PY' | tee "$ART/summary.tsv" +import json +import sys +from pathlib import Path + +art = Path(sys.argv[1]) +rows = [] +for arm in ("baseline", "mtp"): + for path in sorted((art / arm).glob("n*.json")): + data = json.loads(path.read_text()) + rows.append((arm, data["n"], data["gen_total"], data["agg_tps"], + data["decode_agg_tps"], data["decode_perseq_tps"], + data["ttft_mean_ms"], data["wall_s"])) +print("arm\tn\tgen_total\tagg_tps\tdecode_agg_tps\tdecode_perseq_tps\tttft_mean_ms\twall_s") +for row in rows: + print("\t".join(str(x) for x in row)) +PY + +log "artifacts: $ART" diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0001-vendor-paged-kv-manager.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0001-vendor-paged-kv-manager.patch new file mode 100644 index 000000000000..8cce3c973691 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0001-vendor-paged-kv-manager.patch @@ -0,0 +1,448 @@ +From bef64835d444a44ed8391bc395cdab38164229d5 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Fri, 19 Jun 2026 22:54:49 +0000 +Subject: [PATCH] vendor paged kv manager + +vLLM-parity host-side KV block manager (FreeBlockQueue, BlockPool, +PagedKVManager, chained-hash prefix cache). Pure C++17, no behavior change - +nothing uses it yet; wired in by later patches in the series. +--- + src/CMakeLists.txt | 1 + + src/paged-kv-manager.cpp | 296 +++++++++++++++++++++++++++++++++++++++ + src/paged-kv-manager.h | 108 ++++++++++++++ + 3 files changed, 405 insertions(+) + create mode 100644 src/paged-kv-manager.cpp + create mode 100644 src/paged-kv-manager.h + +diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt +index d15ccfd99..a030940b8 100644 +--- a/src/CMakeLists.txt ++++ b/src/CMakeLists.txt +@@ -24,6 +24,7 @@ add_library(llama + llama-io.cpp + llama-kv-cache.cpp + llama-kv-cache-iswa.cpp ++ paged-kv-manager.cpp + llama-kv-cache-dsa.cpp + llama-memory.cpp + llama-memory-hybrid.cpp +diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp +new file mode 100644 +index 000000000..ca0dcd83a +--- /dev/null ++++ b/src/paged-kv-manager.cpp +@@ -0,0 +1,296 @@ ++#include "paged-kv-manager.h" ++#include ++#include ++ ++namespace paged { ++ ++// --------------------------------------------------------------------------- ++// FreeBlockQueue (port of kv_cache_utils.py FreeKVCacheBlockQueue) ++// --------------------------------------------------------------------------- ++ ++FreeBlockQueue::FreeBlockQueue(const std::vector& blocks) { ++ num_free_blocks = blocks.size(); ++ for (size_t i = 0; i < blocks.size(); ++i) { ++ if (i > 0) blocks[i]->prev_free = blocks[i - 1]; ++ if (i + 1 < blocks.size()) blocks[i]->next_free = blocks[i + 1]; ++ } ++ if (!blocks.empty()) { ++ fake_head.next_free = blocks.front(); ++ blocks.front()->prev_free = &fake_head; ++ fake_tail.prev_free = blocks.back(); ++ blocks.back()->next_free = &fake_tail; ++ } else { ++ fake_head.next_free = &fake_tail; ++ fake_tail.prev_free = &fake_head; ++ } ++} ++ ++KVCacheBlock* FreeBlockQueue::popleft() { ++ KVCacheBlock* first = fake_head.next_free; ++ if (first == &fake_tail || first == nullptr) { ++ assert(num_free_blocks == 0); ++ throw std::runtime_error("No free blocks available"); ++ } ++ fake_head.next_free = first->next_free; ++ first->next_free->prev_free = &fake_head; ++ first->prev_free = first->next_free = nullptr; ++ num_free_blocks--; ++ return first; ++} ++ ++std::vector FreeBlockQueue::popleft_n(size_t n) { ++ std::vector ret; ++ if (n == 0) return ret; ++ assert(num_free_blocks >= n); ++ num_free_blocks -= n; ++ KVCacheBlock* curr = fake_head.next_free; ++ ret.reserve(n); ++ for (size_t i = 0; i < n; ++i) { ++ assert(curr != nullptr); ++ ret.push_back(curr); ++ KVCacheBlock* last = curr; ++ curr = curr->next_free; ++ last->prev_free = last->next_free = nullptr; ++ } ++ if (curr != nullptr) { ++ fake_head.next_free = curr; ++ curr->prev_free = &fake_head; ++ } ++ return ret; ++} ++ ++void FreeBlockQueue::remove(KVCacheBlock* block) { ++ if (!block->prev_free || !block->next_free) ++ throw std::runtime_error("remove() called on an invalid block"); ++ block->prev_free->next_free = block->next_free; ++ block->next_free->prev_free = block->prev_free; ++ block->prev_free = block->next_free = nullptr; ++ num_free_blocks--; ++} ++ ++void FreeBlockQueue::append(KVCacheBlock* block) { ++ KVCacheBlock* last = fake_tail.prev_free; ++ last->next_free = block; ++ block->prev_free = last; ++ block->next_free = &fake_tail; ++ fake_tail.prev_free = block; ++ num_free_blocks++; ++} ++ ++void FreeBlockQueue::append_n(const std::vector& blocks) { ++ if (blocks.empty()) return; ++ KVCacheBlock* last = fake_tail.prev_free; ++ for (KVCacheBlock* b : blocks) { ++ b->prev_free = last; ++ last->next_free = b; ++ last = b; ++ } ++ last->next_free = &fake_tail; ++ fake_tail.prev_free = last; ++ num_free_blocks += blocks.size(); ++} ++ ++void FreeBlockQueue::prepend_n(const std::vector& blocks) { ++ if (blocks.empty()) return; ++ KVCacheBlock* first = fake_head.next_free; ++ KVCacheBlock* prev = &fake_head; ++ for (KVCacheBlock* b : blocks) { ++ b->prev_free = prev; ++ prev->next_free = b; ++ prev = b; ++ } ++ prev->next_free = first; ++ first->prev_free = prev; ++ num_free_blocks += blocks.size(); ++} ++ ++std::vector FreeBlockQueue::get_all_free_blocks() const { ++ std::vector ret; ++ const KVCacheBlock* curr = fake_head.next_free; ++ while (curr && curr->next_free != nullptr) { ++ ret.push_back(const_cast(curr)); ++ curr = curr->next_free; ++ } ++ return ret; ++} ++ ++// --------------------------------------------------------------------------- ++// BlockPool (port of block_pool.py) ++// --------------------------------------------------------------------------- ++ ++static std::vector make_ptrs(std::vector& v) { ++ std::vector p; ++ p.reserve(v.size()); ++ for (auto& b : v) p.push_back(&b); ++ return p; ++} ++ ++static std::vector make_block_vec(int32_t num_blocks) { ++ std::vector v; ++ v.reserve(num_blocks); ++ for (int32_t i = 0; i < num_blocks; ++i) v.emplace_back(i); ++ return v; ++} ++ ++BlockPool::BlockPool(int32_t num_blocks, bool enable_caching) ++ : enable_caching_(enable_caching), ++ blocks_(make_block_vec(num_blocks)), ++ ptrs_(make_ptrs(blocks_)), ++ free_queue_(ptrs_) { ++ // vLLM reserves block_id 0 as the null block (never cached). ++ null_block = free_queue_.popleft(); ++ null_block->is_null = true; ++} ++ ++bool BlockPool::maybe_evict_cached_block(KVCacheBlock* block) { ++ if (!block->has_hash) return false; ++ auto it = cached_block_hash_to_block_.find(block->block_hash); ++ if (it == cached_block_hash_to_block_.end() || it->second != block) return false; ++ cached_block_hash_to_block_.erase(it); ++ block->reset_hash(); ++ return true; ++} ++ ++std::vector BlockPool::get_new_blocks(size_t n) { ++ if (n > get_num_free_blocks()) ++ throw std::runtime_error("Cannot get free blocks from pool"); ++ auto ret = free_queue_.popleft_n(n); ++ for (KVCacheBlock* b : ret) { ++ if (enable_caching_) maybe_evict_cached_block(b); ++ assert(b->ref_cnt == 0); ++ b->ref_cnt += 1; ++ } ++ return ret; ++} ++ ++KVCacheBlock* BlockPool::get_cached_block(uint64_t block_hash) { ++ auto it = cached_block_hash_to_block_.find(block_hash); ++ return it == cached_block_hash_to_block_.end() ? nullptr : it->second; ++} ++ ++void BlockPool::touch(const std::vector& blocks) { ++ for (KVCacheBlock* b : blocks) { ++ // ref_cnt==0 means the block is a free-list eviction candidate; pull it out. ++ if (b->ref_cnt == 0 && !b->is_null) free_queue_.remove(b); ++ b->ref_cnt += 1; ++ } ++} ++ ++void BlockPool::free_blocks(const std::vector& ordered_blocks) { ++ std::vector without_hash, with_hash; ++ for (KVCacheBlock* b : ordered_blocks) { ++ if (b->is_null) continue; ++ b->ref_cnt -= 1; ++ if (b->ref_cnt == 0) (b->has_hash ? with_hash : without_hash).push_back(b); ++ } ++ free_queue_.prepend_n(without_hash); // un-hashed: evicted first (front) ++ free_queue_.append_n(with_hash); // hashed: kept warm (tail) ++} ++ ++void BlockPool::cache_full_blocks(const std::vector& req_blocks, ++ size_t num_cached_blocks, size_t num_full_blocks, ++ const std::vector& block_hashes) { ++ for (size_t i = num_cached_blocks; i < num_full_blocks; ++i) { ++ KVCacheBlock* blk = req_blocks[i]; ++ if (blk->has_hash) continue; ++ blk->has_hash = true; ++ blk->block_hash = block_hashes[i]; ++ cached_block_hash_to_block_[blk->block_hash] = blk; ++ } ++} ++ ++// --------------------------------------------------------------------------- ++// PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager) ++// --------------------------------------------------------------------------- ++ ++static inline size_t cdiv(size_t a, size_t b) { return (a + b - 1) / b; } ++ ++PagedKVManager::PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching) ++ : block_size_(block_size), pool_(num_blocks, enable_caching) {} ++ ++bool PagedKVManager::allocate(int seq_id, size_t total_tokens) { ++ auto& req = req_to_blocks_[seq_id]; ++ size_t need = cdiv(total_tokens, block_size_); ++ if (need <= req.size()) return true; ++ size_t add = need - req.size(); ++ if (add > pool_.get_num_free_blocks()) return false; // OOM ++ auto nb = pool_.get_new_blocks(add); ++ req.insert(req.end(), nb.begin(), nb.end()); ++ return true; ++} ++ ++std::vector PagedKVManager::block_table(int seq_id) const { ++ std::vector bt; ++ auto it = req_to_blocks_.find(seq_id); ++ if (it == req_to_blocks_.end()) return bt; ++ bt.reserve(it->second.size()); ++ for (KVCacheBlock* b : it->second) bt.push_back(b->block_id); ++ return bt; ++} ++ ++int64_t PagedKVManager::slot(int seq_id, int pos) const { ++ const auto& req = req_to_blocks_.at(seq_id); ++ int32_t phys = req[pos / block_size_]->block_id; ++ return (int64_t)phys * block_size_ + (pos % block_size_); ++} ++ ++std::vector PagedKVManager::slot_mapping(int seq_id, const std::vector& positions) const { ++ std::vector sm; ++ sm.reserve(positions.size()); ++ for (int p : positions) sm.push_back(slot(seq_id, p)); ++ return sm; ++} ++ ++void PagedKVManager::free(int seq_id) { ++ auto it = req_to_blocks_.find(seq_id); ++ if (it == req_to_blocks_.end()) return; ++ // Free in reverse so the tail of the block chain is evicted first (vLLM order). ++ std::vector ordered(it->second.rbegin(), it->second.rend()); ++ pool_.free_blocks(ordered); ++ req_to_blocks_.erase(it); ++} ++ ++// FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent ++// hash into the seed so each block hash transitively encodes its whole prefix ++// (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes). ++uint64_t PagedKVManager::hash_block(uint64_t parent_hash, const std::vector& token_ids) { ++ uint64_t h = 1469598103934665603ull ^ parent_hash; ++ for (int t : token_ids) { ++ h ^= (uint64_t)(uint32_t)t; ++ h *= 1099511628211ull; ++ } ++ if (h == 0) h = 0x9e3779b97f4a7c15ull; // never 0 (0 reads as "no hash") ++ return h; ++} ++ ++std::vector PagedKVManager::compute_block_hashes(const std::vector& token_ids) const { ++ std::vector hashes; ++ uint64_t parent = 0; // NONE_HASH analogue ++ size_t n_full = token_ids.size() / block_size_; ++ for (size_t i = 0; i < n_full; ++i) { ++ std::vector blk(token_ids.begin() + i * block_size_, ++ token_ids.begin() + (i + 1) * block_size_); ++ parent = hash_block(parent, blk); ++ hashes.push_back(parent); ++ } ++ return hashes; ++} ++ ++size_t PagedKVManager::get_computed_blocks(const std::vector& block_hashes) { ++ std::vector hits; ++ for (uint64_t bh : block_hashes) { // stop at first miss (prefix property) ++ KVCacheBlock* cb = pool_.get_cached_block(bh); ++ if (!cb) break; ++ hits.push_back(cb); ++ } ++ pool_.touch(hits); // ++ref_cnt, pull from free list ++ return hits.size() * (size_t)block_size_; ++} ++ ++void PagedKVManager::cache_blocks(int seq_id, const std::vector& block_hashes, size_t num_tokens) { ++ auto& req = req_to_blocks_[seq_id]; ++ size_t n_full = num_tokens / block_size_; ++ pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes); ++} ++ ++} // namespace paged +diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h +new file mode 100644 +index 000000000..740280a7f +--- /dev/null ++++ b/src/paged-kv-manager.h +@@ -0,0 +1,109 @@ ++#pragma once ++// Paged KV cache block manager for llama.cpp (CPU-first prototype). ++// ++// Host-side block management is a faithful port of vLLM V1: ++// vllm/v1/core/kv_cache_utils.py (KVCacheBlock, FreeKVCacheBlockQueue, hash_block_tokens) ++// vllm/v1/core/block_pool.py (BlockPool: get_new_blocks/touch/free/evict/cache_full_blocks) ++// vllm/v1/core/single_type_kv_cache_manager.py (allocate_new_blocks, find_longest_cache_hit) ++// ++// Parity is on behavior/algorithm (block chaining, first-miss stop, ref-counting, ++// LRU eviction order), not on exact hash bytes. This unit has zero ggml/llama.cpp ++// dependency so it can be unit-tested in isolation. ++ ++#include ++#include ++#include ++#include ++#include ++ ++namespace paged { ++ ++// vLLM KVCacheBlock (kv_cache_utils.py). ++struct KVCacheBlock { ++ int32_t block_id = 0; ++ int ref_cnt = 0; ++ bool has_hash = false; // vLLM: _block_hash is set only when full+cached ++ uint64_t block_hash = 0; ++ bool is_null = false; ++ KVCacheBlock* prev_free = nullptr; ++ KVCacheBlock* next_free = nullptr; ++ ++ explicit KVCacheBlock(int32_t id = 0) : block_id(id) {} ++ void reset_hash() { has_hash = false; block_hash = 0; } ++}; ++ ++// Intrusive doubly-linked free list with fake head/tail (vLLM FreeKVCacheBlockQueue). ++// O(1) middle removal is required so touch() can pull a warm cached block out of the ++// free list when a later request hits its prefix. ++class FreeBlockQueue { ++public: ++ size_t num_free_blocks = 0; ++ ++ explicit FreeBlockQueue(const std::vector& blocks); ++ KVCacheBlock* popleft(); ++ std::vector popleft_n(size_t n); ++ void remove(KVCacheBlock* block); ++ void append(KVCacheBlock* block); ++ void append_n(const std::vector& blocks); ++ void prepend_n(const std::vector& blocks); ++ std::vector get_all_free_blocks() const; ++ ++private: ++ KVCacheBlock fake_head{-1}; ++ KVCacheBlock fake_tail{-1}; ++}; ++ ++// vLLM BlockPool (block_pool.py). ++class BlockPool { ++public: ++ KVCacheBlock* null_block = nullptr; ++ ++ BlockPool(int32_t num_blocks, bool enable_caching); ++ std::vector get_new_blocks(size_t n); ++ KVCacheBlock* get_cached_block(uint64_t block_hash); ++ void touch(const std::vector& blocks); ++ void free_blocks(const std::vector& ordered_blocks); ++ void cache_full_blocks(const std::vector& req_blocks, ++ size_t num_cached_blocks, size_t num_full_blocks, ++ const std::vector& block_hashes); ++ size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; } ++ ++private: ++ bool maybe_evict_cached_block(KVCacheBlock* block); ++ ++ bool enable_caching_; ++ std::vector blocks_; // owns all block descriptors ++ std::vector ptrs_; ++ FreeBlockQueue free_queue_; ++ // vLLM stores hash -> {block_id: block} to allow duplicate-content blocks; the ++ // prototype keeps the last writer (single KV-cache group is sufficient for the wins). ++ std::unordered_map cached_block_hash_to_block_; ++}; ++ ++// Allocation + prefix-caching surface, ported from SingleTypeKVCacheManager / ++// FullAttentionManager. Single KV-cache group; no extra_keys / eagle / spec-decode. ++class PagedKVManager { ++public: ++ PagedKVManager(int32_t num_blocks, int block_size, bool enable_caching); ++ ++ // Grow seq_id to cover total_tokens slots. Returns false on OOM (free queue empty). ++ bool allocate(int seq_id, size_t total_tokens); ++ std::vector block_table(int seq_id) const; ++ int64_t slot(int seq_id, int pos) const; ++ std::vector slot_mapping(int seq_id, const std::vector& positions) const; ++ void free(int seq_id); ++ int block_size() const { return block_size_; } ++ ++ // Prefix caching (win 3). ++ static uint64_t hash_block(uint64_t parent_hash, const std::vector& token_ids); ++ std::vector compute_block_hashes(const std::vector& token_ids) const; ++ size_t get_computed_blocks(const std::vector& block_hashes); // returns num cached tokens ++ void cache_blocks(int seq_id, const std::vector& block_hashes, size_t num_tokens); ++ ++protected: ++ int block_size_; ++ BlockPool pool_; ++ std::map> req_to_blocks_; ++}; ++ ++} // namespace paged +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..3ba88af4c513 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0002-paged-kv-block-placement-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,75 @@ +From 5c9c709e6c6b07e0399b75fd4e46e752d418a9a8 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Fri, 19 Jun 2026 23:04:17 +0000 +Subject: [PATCH] paged kv block placement (env LLAMA_KV_PAGED) + +Place each sequence's tokens at permuted, non-contiguous fixed-size block +positions in find_slot, proving attention is invariant to physical KV placement +(token-identical greedy generation). Default off; single-sequence scope; falls +back to the normal allocator. The paged-placement substrate for the gather-read. +--- + src/llama-kv-cache.cpp | 41 +++++++++++++++++++++++++++++++++++++++++ + 1 file changed, 41 insertions(+) + +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 2802103bd..999e2ae61 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -11,6 +11,8 @@ + #include + #include + #include ++#include ++#include + #include + + static bool ggml_is_power_of_2(int n) { +@@ -1020,6 +1022,45 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch, + return { }; + } + ++ // [paged, experimental] Place this sequence's tokens at permuted, ++ // non-contiguous fixed-size BLOCK positions instead of a contiguous run. ++ // This validates that attention is invariant to physical KV placement - ++ // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED. ++ // Single-sequence scope (uses get_used() as the logical base); falls back ++ // to the normal allocator if the permuted cells aren't available. ++ static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr); ++ if (paged_mode) { ++ const uint32_t bs = 16; // block size (tokens/block) ++ const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool ++ if (nblk >= 2) { ++ // stride coprime to nblk => block-index permutation is a bijection ++ uint32_t k = 1; ++ for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) { ++ if (std::gcd(cand, nblk) == 1u) { k = cand; break; } ++ } ++ const uint32_t base = cells.get_used(); ++ bool ok = true; ++ for (uint32_t i = 0; i < n_tokens; ++i) { ++ const uint32_t L = base + i; ++ const uint32_t b = L / bs; ++ const uint32_t off = L % bs; ++ if (b >= nblk) { ok = false; break; } ++ const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block ++ if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; } ++ res.idxs[s].push_back(phys); ++ } ++ if (ok && res.idxs[s].size() == n_tokens) { ++ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) { ++ fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens); ++ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]); ++ fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base); ++ } ++ continue; // paged placement succeeded for this sequence ++ } ++ res.idxs[s].clear(); // fall back to the normal allocator ++ } ++ } ++ + uint32_t n_tested = 0; + + // for continuous slots, we test that all tokens in the ubatch fit, starting from the current head +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..347f34f15b2b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0003-paged-gather-read-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,370 @@ +From c1de00f4cc1eb0dd25993880bb4c8562be1937d4 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 10:24:22 +0200 +Subject: [PATCH] paged gather-read (env LLAMA_KV_PAGED) - patch 0003 + +Gather K, V and the kq_mask down to each sequence stream's non-empty cells +before build_attn_mha. Position-sorted per stream so the flash-attn online +softmax reduction order matches stock byte-for-byte. Multi-stream: one index +column per stream over k->ne[3], padded to the max non-empty count with a +masked (empty) cell. Gated behind LLAMA_KV_PAGED; no-op when unset. +--- + src/CMakeLists.txt | 1 + + src/llama-graph.cpp | 9 ++- + src/llama-kv-cache.cpp | 74 ++++++++++++++++++++++++ + src/llama-kv-cache.h | 11 ++++ + src/paged-attn.cpp | 128 +++++++++++++++++++++++++++++++++++++++++ + src/paged-attn.h | 40 +++++++++++++ + 6 files changed, 262 insertions(+), 1 deletion(-) + create mode 100644 src/paged-attn.cpp + create mode 100644 src/paged-attn.h + +diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt +index a030940..58083b3 100644 +--- a/src/CMakeLists.txt ++++ b/src/CMakeLists.txt +@@ -25,6 +25,7 @@ add_library(llama + llama-kv-cache.cpp + llama-kv-cache-iswa.cpp + paged-kv-manager.cpp ++ paged-attn.cpp + llama-kv-cache-dsa.cpp + llama-memory.cpp + llama-memory-hybrid.cpp +diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp +index 68c9e60..b59d2a5 100644 +--- a/src/llama-graph.cpp ++++ b/src/llama-graph.cpp +@@ -6,6 +6,8 @@ + #include "llama-cparams.h" + + #include "llama-kv-cache.h" ++ ++#include "paged-attn.h" + #include "llama-kv-cache-iswa.h" + #include "llama-kv-cache-dsa.h" + #include "llama-memory-hybrid.h" +@@ -2356,7 +2358,12 @@ ggml_tensor * llm_graph_context::build_attn( + ggml_tensor * k = mctx_cur->get_k(ctx0, il); + ggml_tensor * v = mctx_cur->get_v(ctx0, il); + +- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask, sinks, v_mla, kq_scale, il); ++ // [paged 0003] gather K, V and the mask to the sequence's used cells only ++ // (no-op unless env LLAMA_KV_PAGED is set). ++ ggml_tensor * kq_mask_g = kq_mask; ++ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g); ++ ++ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il); + cb(cur, "kqv_out", il); + + if (inp->self_v_rot) { +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 999e2ae..30d02d7 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -1,4 +1,6 @@ + #include "llama-kv-cache.h" ++#include ++#include + + #include "llama-impl.h" + #include "llama-io.h" +@@ -1329,6 +1331,70 @@ ggml_tensor * llama_kv_cache::get_v(ggml_context * ctx, int32_t il, uint32_t n_k + ggml_row_size(v->type, kv_size*n_embd_v_gqa)*sinfo.s0); + } + ++// [paged 0003] gather-read: enumerate the non-empty cells in [0, n_kv) for the ++// single stream addressed by sinfo. With paged placement (patch 0002) these are ++// the sequence's scattered block cells; gathering K/V/mask by this index list ++// compacts the attention read while preserving every unmasked (token,cell) pair. ++uint32_t llama_kv_cache::get_n_gather(uint32_t n_kv, const slot_info & sinfo) const { ++ // Multi-stream: the gathered K/V/mask tensors are rectangular [.., n_gather, ++ // n_stream], so n_gather is the MAX non-empty count across the batch streams. ++ // Streams with fewer cells are padded (see get_gather_idxs) with a masked ++ // (empty) cell index, which contributes exp(-inf)=0 and is thus a no-op. ++ // K is laid out over physical streams [s0, s1]; index v_cells the same way. ++ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1; ++ uint32_t mx = 0; ++ for (uint32_t j = 0; j < ns; ++j) { ++ const auto & cells = v_cells[sinfo.s0 + j]; ++ const uint32_t n = std::min(n_kv, cells.size()); ++ uint32_t cnt = 0; ++ for (uint32_t i = 0; i < n; ++i) { ++ if (!cells.is_empty(i)) { ++ ++cnt; ++ } ++ } ++ mx = std::max(mx, cnt); ++ } ++ return mx; ++} ++ ++void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const { ++ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1; ++ const uint32_t n_gather = get_n_gather(n_kv, sinfo); ++ // dst is [n_gather, n_stream] (ne0 = n_gather): column s at dst[s*n_gather..]. ++ for (uint32_t j = 0; j < ns; ++j) { ++ const auto & cells = v_cells[sinfo.s0 + j]; ++ const uint32_t n = std::min(n_kv, cells.size()); ++ // Collect the non-empty cells, then order them by token POSITION (not by ++ // physical cell index). The attention reduction (flash-attn online ++ // softmax, and the non-flash soft_max) runs over cells in array order and ++ // is order-sensitive in floating point. Stock (contiguous) placement ++ // happens to store cells in position order, so emitting the gathered ++ // indices in position order reproduces stock's exact reduction order - ++ // making the paged read bit-identical, not merely math-equivalent. ++ std::vector> pc; ++ pc.reserve(n); ++ int32_t pad = -1; ++ for (uint32_t i = 0; i < n; ++i) { ++ if (!cells.is_empty(i)) { ++ pc.emplace_back(cells.pos_get(i), (int32_t) i); ++ } else if (pad < 0) { ++ pad = (int32_t) i; // first empty cell: its mask is -inf -> safe pad ++ } ++ } ++ std::sort(pc.begin(), pc.end()); ++ int32_t * col = dst + (size_t) j * n_gather; ++ for (size_t k = 0; k < pc.size(); ++k) { ++ col[k] = pc[k].second; ++ } ++ // Pad the tail to n_gather with a masked (empty) cell so the rectangular ++ // gather drops to zero contribution for streams shorter than the max. ++ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second); ++ for (uint32_t k = (uint32_t) pc.size(); k < n_gather; ++k) { ++ col[k] = padv; ++ } ++ } ++} ++ + ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const { + GGML_UNUSED(sinfo); + +@@ -2620,6 +2686,14 @@ ggml_tensor * llama_kv_cache_context::get_v(ggml_context * ctx, int32_t il) cons + return kv->get_v(ctx, il, n_kv, sinfos[i_cur]); + } + ++uint32_t llama_kv_cache_context::get_n_gather() const { ++ return kv->get_n_gather(n_kv, sinfos[i_cur]); ++} ++ ++void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const { ++ kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]); ++} ++ + ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const { + return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]); + } +diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h +index 3d68f98..494c0fb 100644 +--- a/src/llama-kv-cache.h ++++ b/src/llama-kv-cache.h +@@ -171,6 +171,12 @@ public: + ggml_tensor * get_k(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const; + ggml_tensor * get_v(ggml_context * ctx, int32_t il, uint32_t n_kv, const slot_info & sinfo) const; + ++ // [paged 0003] count / list the non-empty cells in [0, n_kv) per stream of ++ // sinfo (position-sorted, padded across streams). Used by paged-attn ++ // gather-read. get_n_gather returns the max count across streams. ++ uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const; ++ void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const; ++ + // store k_cur and v_cur in the cache based on the provided head location + ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const; + ggml_tensor * cpy_v(ggml_context * ctx, ggml_tensor * v_cur, ggml_tensor * v_idxs, int32_t il, const slot_info & sinfo) const; +@@ -368,6 +374,11 @@ public: + ggml_tensor * get_k(ggml_context * ctx, int32_t il) const; + ggml_tensor * get_v(ggml_context * ctx, int32_t il) const; + ++ // [paged 0003] gather-read helpers (delegate to the kv cache for the ++ // current ubatch's stream). ++ uint32_t get_n_gather() const; ++ void get_gather_idxs(int32_t * dst) const; ++ + // store k_cur and v_cur in the cache based on the provided head location + // note: the heads in k_cur and v_cur should be laid out contiguously in memory + // - k_cur [n_embd_head_k, n_head_k, n_tokens] +diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp +new file mode 100644 +index 0000000..ade75e8 +--- /dev/null ++++ b/src/paged-attn.cpp +@@ -0,0 +1,128 @@ ++#include "paged-attn.h" ++ ++#include "llama-graph.h" ++#include "llama-kv-cache.h" ++ ++#include "ggml.h" ++#include "ggml-backend.h" ++ ++#include ++#include ++ ++namespace paged_attn { ++ ++bool active() { ++ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr); ++ return a; ++} ++ ++static bool debug() { ++ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr); ++ return d; ++} ++ ++namespace { ++ ++// Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor ++// with each stream's non-empty cell indices (position-sorted, padded with a ++// masked/empty cell) by delegating to the kv-cache context. Private to this ++// unit; default can_reuse()==false keeps the graph from being reused across ++// decodes (n_gather grows every step). ++class input_gather_idxs : public llm_graph_input_i { ++public: ++ input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs) ++ : mctx(mctx), idxs(idxs) {} ++ ++ void set_input(const llama_ubatch * ubatch) override { ++ GGML_UNUSED(ubatch); ++ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer)); ++ mctx->get_gather_idxs((int32_t *) idxs->data); ++ } ++ ++ const llama_kv_cache_context * mctx; ++ ggml_tensor * idxs; ++}; ++ ++} // namespace ++ ++void gather(ggml_context * ctx0, ++ llm_graph_result * res, ++ const llama_kv_cache_context * mctx, ++ ggml_tensor ** k, ++ ggml_tensor ** v, ++ ggml_tensor ** kq_mask) { ++ if (!active()) { ++ return; ++ } ++ ++ ggml_tensor * K = *k; ++ ggml_tensor * V = *v; ++ ggml_tensor * M = *kq_mask; ++ ++ // Number of streams (sequences) in the unified batch. K is laid out ++ // [d, h, n_kv, n_stream] and the mask is [n_kv, n_tps, 1, n_stream]; the ++ // gather is per-stream (one index column per stream), so a single ++ // ggml_get_rows over the stream axis handles 1..N streams uniformly. ++ const int64_t n_stream = K->ne[3]; ++ GGML_ASSERT(M->ne[3] == n_stream); ++ ++ const int64_t n_gather = (int64_t) mctx->get_n_gather(); ++ if (n_gather <= 0) { ++ // Worst-case graph reserve (empty cache) or nothing placed yet: leave ++ // the full [0, n_kv) read untouched so buffer sizing stays worst-case. ++ return; ++ } ++ ++ if (debug()) { ++ static int64_t once = 0; ++ if (once++ < 2) { ++ fprintf(stderr, "[paged-attn] gather n_stream=%lld n_kv=%lld n_gather=%lld\n", ++ (long long) n_stream, (long long) K->ne[2], (long long) n_gather); ++ } ++ } ++ ++ // Per-stream index tensor [n_gather, n_stream], filled at set_input from ++ // each stream's non-empty cells. ggml_get_rows broadcasts along ne[1]== ++ // n_stream, so column s gathers from stream s of the source. ++ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream); ++ ggml_set_input(idx); ++ res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx))); ++ ++ // --- gather K: collapse (head_dim, n_head) so cells become the row axis --- ++ { ++ ggml_tensor * t = ggml_cont(ctx0, K); // [d, h, n_kv, ns] ++ t = ggml_reshape_3d(ctx0, t, K->ne[0]*K->ne[1], K->ne[2], n_stream); // [d*h, n_kv, ns] ++ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns] ++ *k = ggml_reshape_4d(ctx0, t, K->ne[0], K->ne[1], n_gather, n_stream); // [d, h, n_gather, ns] ++ } ++ ++ // --- gather V --- ++ // Normalize to a non-transposed [d, h, n_kv, ns] view first, so the gathered ++ // result is contiguous and build_attn_mha sees a consistent v_trans==false. ++ { ++ const bool v_trans = V->nb[1] > V->nb[2]; ++ ggml_tensor * vsrc = v_trans ++ ? ggml_permute(ctx0, V, 2, 1, 0, 3) // [n_kv, h, d, ns] -> [d, h, n_kv, ns] ++ : V; // already [d, h, n_kv, ns] ++ ggml_tensor * t = ggml_cont(ctx0, vsrc); // [d, h, n_kv, ns] ++ t = ggml_reshape_3d(ctx0, t, vsrc->ne[0]*vsrc->ne[1], vsrc->ne[2], n_stream); // [d*h, n_kv, ns] ++ t = ggml_get_rows(ctx0, t, idx); // [d*h, n_gather, ns] ++ *v = ggml_reshape_4d(ctx0, t, vsrc->ne[0], vsrc->ne[1], n_gather, n_stream); // [d, h, n_gather, ns] ++ } ++ ++ // --- gather mask (cells are ne0): transpose so cells become the row axis, ++ // gather per stream, transpose back --- ++ { ++ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream); // [n_kv, n_tps, ns] ++ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_tps, n_kv, ns] ++ m = ggml_get_rows(ctx0, m, idx); // [n_tps, n_gather, ns] (F32) ++ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); // [n_gather, n_tps, ns] ++ m = ggml_reshape_4d(ctx0, m, n_gather, M->ne[1], 1, n_stream); ++ if (M->type != m->type) { ++ m = ggml_cast(ctx0, m, M->type); // flash-attn requires an F16 mask ++ } ++ *kq_mask = m; ++ } ++} ++ ++} // namespace paged_attn +diff --git a/src/paged-attn.h b/src/paged-attn.h +new file mode 100644 +index 0000000..c5b7bd7 +--- /dev/null ++++ b/src/paged-attn.h +@@ -0,0 +1,41 @@ ++#pragma once ++// Paged attention gather-read (patch 0003, experimental). ++// ++// Companion to the paged block placement in llama_kv_cache::find_slot (patch ++// 0002). Patch 0002 places a sequence's tokens at permuted, non-contiguous ++// fixed-size block cells, but attention still reads the whole [0, n_kv) window ++// (empty cells masked to -inf). This unit compacts that read: it gathers K, V ++// and the kq_mask down to ONLY the sequence's used (non-empty) cells before ++// build_attn_mha. ++// ++// Correctness: attention is permutation-invariant over the KV set, and dropping ++// already-masked empty cells removes only exp(-inf)=0 terms - so greedy output ++// is identical to stock. Gated behind env LLAMA_KV_PAGED; a no-op when unset. ++// ++// All logic lives here to keep the core files additive: build_attn gets one ++// call, llama_kv_cache_context gets two thin accessors, CMake gets one line. ++ ++#include ++#include ++ ++struct ggml_context; ++struct ggml_tensor; ++class llm_graph_result; ++class llama_kv_cache_context; ++ ++namespace paged_attn { ++ ++// true iff env LLAMA_KV_PAGED is set (evaluated once). ++bool active(); ++ ++// Gather K, V and the kq_mask down to the current sequence's non-empty cells. ++// No-op (returns immediately) unless active(). On return *k, *v and *kq_mask ++// point at the compacted tensors; pass them straight to build_attn_mha. ++void gather(ggml_context * ctx0, ++ llm_graph_result * res, ++ const llama_kv_cache_context * mctx, ++ ggml_tensor ** k, ++ ggml_tensor ** v, ++ ggml_tensor ** kq_mask); ++ ++} // namespace paged_attn +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..35ab5f942db1 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0004-paged-on-demand-block-allocation-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,298 @@ +From 7c294973de28d1ac991505638d726acfb371d541 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 10:50:35 +0200 +Subject: [PATCH] paged on-demand block allocation (env LLAMA_KV_PAGED) - patch + 0004 + +Drive the paged placement in find_slot through the vendored PagedKVManager +(patch 0001) instead of a fixed full-pool permutation. Blocks are popped from a +free pool on demand as the sequence crosses block boundaries (peak << full +reservation) and returned on sequence end (seq_rm full removal / clear). One +manager per (kv-cache, stream); all state lives in the new src/paged-alloc unit, +so the core kv-cache struct is untouched - find_slot/clear/seq_rm gain only a +gated call. Default off; stock path byte-identical. +--- + src/CMakeLists.txt | 1 + + src/llama-kv-cache.cpp | 69 +++++++++++++++++---------- + src/paged-alloc.cpp | 106 +++++++++++++++++++++++++++++++++++++++++ + src/paged-alloc.h | 39 +++++++++++++++ + 4 files changed, 190 insertions(+), 25 deletions(-) + create mode 100644 src/paged-alloc.cpp + create mode 100644 src/paged-alloc.h + +diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt +index 58083b3..4d9d7d1 100644 +--- a/src/CMakeLists.txt ++++ b/src/CMakeLists.txt +@@ -26,6 +26,7 @@ add_library(llama + llama-kv-cache-iswa.cpp + paged-kv-manager.cpp + paged-attn.cpp ++ paged-alloc.cpp + llama-kv-cache-dsa.cpp + llama-memory.cpp + llama-memory-hybrid.cpp +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 30d02d7..1125d9a 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -1,4 +1,5 @@ + #include "llama-kv-cache.h" ++#include "paged-alloc.h" + #include + #include + +@@ -381,6 +382,11 @@ llama_kv_cache::llama_kv_cache( + } + + void llama_kv_cache::clear(bool data) { ++ // [paged 0004] return all on-demand blocks to the pool on cache clear. ++ if (paged_alloc::active()) { ++ paged_alloc::release_all(this); ++ } ++ + for (uint32_t s = 0; s < n_stream; ++s) { + v_cells[s].reset(); + v_heads[s] = 0; +@@ -409,6 +415,16 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) { + p1 = std::numeric_limits::max(); + } + ++ // [paged 0004] free a stream's on-demand blocks when its whole sequence is ++ // removed (sequence end), so they return to the pool for reuse. ++ if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits::max()) { ++ if (seq_id >= 0) { ++ paged_alloc::release(this, (int) seq_to_stream[seq_id]); ++ } else { ++ paged_alloc::release_all(this); ++ } ++ } ++ + if (seq_id >= 0) { + auto & cells = v_cells[seq_to_stream[seq_id]]; + auto & head = v_heads[seq_to_stream[seq_id]]; +@@ -1030,36 +1046,39 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch, + // the correctness premise of paged attention. Enabled via LLAMA_KV_PAGED. + // Single-sequence scope (uses get_used() as the logical base); falls back + // to the normal allocator if the permuted cells aren't available. +- static const bool paged_mode = (std::getenv("LLAMA_KV_PAGED") != nullptr); +- if (paged_mode) { ++ // [paged 0004] On-demand block allocation. Patch 0002 proved attention is ++ // invariant to physical KV placement; here that placement is driven by ++ // the vendored PagedKVManager (patch 0001): blocks are popped from a free ++ // pool only as the sequence crosses block boundaries (peak << full ++ // reservation) and returned on sequence end. Enabled via LLAMA_KV_PAGED; ++ // falls back to the normal allocator on pool exhaustion or any conflict. ++ if (paged_alloc::active()) { + const uint32_t bs = 16; // block size (tokens/block) +- const uint32_t nblk = cells.size() / bs; // blocks in this stream's pool ++ const uint32_t nblk = cells.size() / bs; // this stream's block budget + if (nblk >= 2) { +- // stride coprime to nblk => block-index permutation is a bijection +- uint32_t k = 1; +- for (uint32_t cand = (nblk / 2) | 1u; cand < nblk; cand += 2) { +- if (std::gcd(cand, nblk) == 1u) { k = cand; break; } +- } + const uint32_t base = cells.get_used(); +- bool ok = true; +- for (uint32_t i = 0; i < n_tokens; ++i) { +- const uint32_t L = base + i; +- const uint32_t b = L / bs; +- const uint32_t off = L % bs; +- if (b >= nblk) { ok = false; break; } +- const uint32_t phys = ((b * k) % nblk) * bs + off; // permuted block +- if (phys >= cells.size() || !cells.is_empty(phys)) { ok = false; break; } +- res.idxs[s].push_back(phys); +- } +- if (ok && res.idxs[s].size() == n_tokens) { +- if (std::getenv("LLAMA_KV_PAGED_DEBUG")) { +- fprintf(stderr, "[paged] seq placed %u tok at cells:", n_tokens); +- for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]); +- fprintf(stderr, " (k=%u nblk=%u base=%u)\n", k, nblk, base); ++ const int strm = (int) seq_to_stream[seq_id]; ++ std::vector placed; ++ if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) { ++ bool ok = (placed.size() == n_tokens); ++ for (uint32_t i = 0; ok && i < n_tokens; ++i) { ++ if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) { ++ ok = false; ++ } ++ } ++ if (ok) { ++ for (uint32_t phys : placed) { ++ res.idxs[s].push_back(phys); ++ } ++ if (std::getenv("LLAMA_KV_PAGED_DEBUG")) { ++ fprintf(stderr, "[paged] stream %d placed %u tok at cells:", strm, n_tokens); ++ for (uint32_t z = 0; z < res.idxs[s].size() && z < 24; ++z) fprintf(stderr, " %u", res.idxs[s][z]); ++ fprintf(stderr, " (nblk=%u base=%u)\n", nblk, base); ++ } ++ continue; // on-demand paged placement succeeded + } +- continue; // paged placement succeeded for this sequence ++ res.idxs[s].clear(); // fall back to the normal allocator + } +- res.idxs[s].clear(); // fall back to the normal allocator + } + } + +diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp +new file mode 100644 +index 0000000..1d13f9c +--- /dev/null ++++ b/src/paged-alloc.cpp +@@ -0,0 +1,106 @@ ++#include "paged-alloc.h" ++#include "paged-kv-manager.h" ++ ++#include ++#include ++#include ++#include ++#include ++ ++namespace paged_alloc { ++ ++bool active() { ++ static const bool a = (std::getenv("LLAMA_KV_PAGED") != nullptr); ++ return a; ++} ++ ++static bool debug() { ++ static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr); ++ return d; ++} ++ ++namespace { ++ ++using key_t = std::pair; ++ ++// One PagedKVManager per (kv-cache, stream): each stream owns a separate ++// physical pool of cells.size() cells, so a manager's block ids map directly to ++// cell ranges within that stream's pool. The internal request id is always 0. ++std::map> g_managers; ++ ++paged::PagedKVManager * get_mgr(const void * cache, int stream, ++ uint32_t pool_blocks, uint32_t block_size) { ++ const key_t k{cache, stream}; ++ auto it = g_managers.find(k); ++ if (it == g_managers.end()) { ++ // enable_caching=false: prefix caching is a later patch; 0004 exercises ++ // only on-demand allocate / free. ++ auto mgr = std::make_unique( ++ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false); ++ it = g_managers.emplace(k, std::move(mgr)).first; ++ } ++ return it->second.get(); ++} ++ ++} // namespace ++ ++bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens, ++ uint32_t block_size, uint32_t pool_blocks, ++ std::vector & out) { ++ if (n_tokens == 0) { ++ return true; ++ } ++ ++ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size); ++ ++ const size_t before = mgr->block_table(0).size(); ++ ++ // Grow the request to cover the highest logical position. The manager pops ++ // free blocks only for the boundaries actually crossed - that is the on- ++ // demand behavior; an already-covered range adds nothing. ++ if (!mgr->allocate(0, (size_t) base + n_tokens)) { ++ return false; // pool exhausted -> caller falls back to the stock path ++ } ++ ++ out.reserve(out.size() + n_tokens); ++ for (uint32_t i = 0; i < n_tokens; ++i) { ++ const int64_t s = mgr->slot(0, (int) (base + i)); ++ out.push_back((uint32_t) s); ++ } ++ ++ if (debug()) { ++ const size_t after = mgr->block_table(0).size(); ++ if (after != before) { ++ fprintf(stderr, ++ "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks " ++ "(budget=%u; base=%u +%u tok)\n", ++ cache, stream, before, after, pool_blocks, base, n_tokens); ++ } ++ } ++ ++ return true; ++} ++ ++void release(const void * cache, int stream) { ++ auto it = g_managers.find({cache, stream}); ++ if (it == g_managers.end()) { ++ return; ++ } ++ it->second->free(0); ++ g_managers.erase(it); ++ if (debug()) { ++ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream); ++ } ++} ++ ++void release_all(const void * cache) { ++ for (auto it = g_managers.begin(); it != g_managers.end(); ) { ++ if (it->first.first == cache) { ++ it = g_managers.erase(it); ++ } else { ++ ++it; ++ } ++ } ++} ++ ++} // namespace paged_alloc +diff --git a/src/paged-alloc.h b/src/paged-alloc.h +new file mode 100644 +index 0000000..bf66665 +--- /dev/null ++++ b/src/paged-alloc.h +@@ -0,0 +1,39 @@ ++#pragma once ++// On-demand paged KV block allocation (patch 0004, experimental). ++// ++// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the ++// vendored host-side PagedKVManager (patch 0001). Instead of mapping a ++// sequence's logical positions onto a fixed full-pool permutation, blocks are ++// popped from a free pool ON DEMAND as the sequence crosses block boundaries, ++// and returned to the pool on sequence end. This is where the paged memory- ++// capacity benefit begins: a short sequence holds only a few blocks, not the ++// whole reserved window. ++// ++// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this ++// unit (a static registry keyed by kv-cache + stream), so the core kv-cache ++// struct stays untouched - find_slot only gains a gated call. ++ ++#include ++#include ++ ++namespace paged_alloc { ++ ++// true iff env LLAMA_KV_PAGED is set (evaluated once). ++bool active(); ++ ++// Place n_tokens logical positions [base, base+n_tokens) of one stream on ++// demand, appending their physical cell indices to `out`. pool_blocks = ++// cells.size()/block_size is this stream's block budget. Returns false (leaving ++// `out` unchanged) on pool exhaustion, so the caller falls back to the stock ++// allocator. The caller still validates each returned cell is empty. ++bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens, ++ uint32_t block_size, uint32_t pool_blocks, ++ std::vector & out); ++ ++// Return a stream's blocks to the pool (sequence end). ++void release(const void * cache, int stream); ++ ++// Return every stream's blocks for a kv-cache (clear() / teardown). ++void release_all(const void * cache); ++ ++} // namespace paged_alloc +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..a1d4f198a513 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0006-paged-cross-request-prefix-caching-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,143 @@ +From 141029beec609e87f24f6f6bba3ec842d7037862 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 12:13:44 +0200 +Subject: [PATCH] paged cross-request prefix caching (env LLAMA_KV_PAGED) - + patch 0006 + +Add host-side cross-request prefix sharing to the vendored PagedKVManager +(patches 0001-0004): on placement, hash a new sequence prefix blocks, reuse the +matching cached physical blocks (ref_cnt++) for the shared prefix and allocate +fresh blocks only for the divergent suffix. A shared block is freed only at +ref 0; copy-on-write privatises a still-shared (ref>1) block before a divergent +write so co-owners stay byte-correct. All logic lives in the vendored +src/paged-kv-manager unit (place_with_prefix / cow_block / ref-counting); the +core kv-cache files are untouched. Default off; gated behind LLAMA_KV_PAGED. + +Wiring the physical-cell reuse into find_slot so the engine itself skips +recompute needs core seq-membership changes and is left to a later patch. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/paged-kv-manager.cpp | 65 ++++++++++++++++++++++++++++++++++++++++ + src/paged-kv-manager.h | 23 ++++++++++++++ + 2 files changed, 88 insertions(+) + +diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp +index ca0dcd8..4c6ee4c 100644 +--- a/src/paged-kv-manager.cpp ++++ b/src/paged-kv-manager.cpp +@@ -293,4 +293,69 @@ void PagedKVManager::cache_blocks(int seq_id, const std::vector& block + pool_.cache_full_blocks(req, /*num_cached=*/0, n_full, block_hashes); + } + ++// --------------------------------------------------------------------------- ++// Cross-request prefix caching + copy-on-write (patch 0006) ++// --------------------------------------------------------------------------- ++ ++size_t PagedKVManager::place_with_prefix(int seq_id, const std::vector& token_ids) { ++ auto& req = req_to_blocks_[seq_id]; ++ ++ // Longest cached prefix: hash the full blocks and stop at the first miss. ++ // A block hash transitively encodes its whole prefix (FNV chaining), so the ++ // first miss bounds the reusable prefix (vLLM find_longest_cache_hit). ++ const std::vector hashes = compute_block_hashes(token_ids); ++ std::vector hits; ++ for (uint64_t bh : hashes) { ++ KVCacheBlock* cb = pool_.get_cached_block(bh); ++ if (!cb) break; ++ hits.push_back(cb); ++ } ++ ++ // Reuse: ++ref_cnt (pulling warm blocks back out of the free list) then ++ // splice the shared physical blocks into this sequence's block table. ++ pool_.touch(hits); ++ req.insert(req.end(), hits.begin(), hits.end()); ++ ++ // Allocate fresh blocks only for the divergent suffix. ++ const size_t need = cdiv(token_ids.size(), block_size_); ++ if (need > req.size()) { ++ const size_t add = need - req.size(); ++ if (add > pool_.get_num_free_blocks()) { ++ // OOM: roll the sequence back (un-touch the shared prefix so no ref ++ // leaks) and report no placement; the caller falls back to stock. ++ std::vector ordered(req.rbegin(), req.rend()); ++ pool_.free_blocks(ordered); ++ req.clear(); ++ return 0; ++ } ++ auto nb = pool_.get_new_blocks(add); ++ req.insert(req.end(), nb.begin(), nb.end()); ++ } ++ return hits.size(); ++} ++ ++std::pair PagedKVManager::cow_block(int seq_id, size_t bi) { ++ auto& req = req_to_blocks_.at(seq_id); ++ KVCacheBlock* old = req.at(bi); ++ if (old->ref_cnt <= 1) { ++ return { old->block_id, old->block_id }; // already private - no copy ++ } ++ // Private copy for this sequence. get_new_blocks sets the fresh block's ++ // ref_cnt to 1; free_blocks decrements the shared block, which stays >0 so ++ // it is NOT returned to the pool and the other owners are left untouched. ++ KVCacheBlock* fresh = pool_.get_new_blocks(1).front(); ++ pool_.free_blocks({ old }); ++ req[bi] = fresh; ++ return { old->block_id, fresh->block_id }; ++} ++ ++int PagedKVManager::block_ref_cnt_at(int seq_id, size_t bi) const { ++ return req_to_blocks_.at(seq_id).at(bi)->ref_cnt; ++} ++ ++size_t PagedKVManager::num_blocks(int seq_id) const { ++ auto it = req_to_blocks_.find(seq_id); ++ return it == req_to_blocks_.end() ? 0 : it->second.size(); ++} ++ + } // namespace paged +diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h +index 740280a..34decbc 100644 +--- a/src/paged-kv-manager.h ++++ b/src/paged-kv-manager.h +@@ -14,6 +14,7 @@ + #include + #include + #include ++#include + + namespace paged { + +@@ -99,6 +100,28 @@ public: + size_t get_computed_blocks(const std::vector& block_hashes); // returns num cached tokens + void cache_blocks(int seq_id, const std::vector& block_hashes, size_t num_tokens); + ++ // Cross-request prefix caching + copy-on-write (patch 0006). ++ // ++ // Splice the longest cached prefix of token_ids into seq_id (reuse the ++ // shared physical blocks, ref_cnt++ so a block frees only at ref 0) and ++ // allocate fresh blocks only for the divergent suffix. Returns the number of ++ // shared (reused) blocks; the caller skips recomputing those tokens. On pool ++ // exhaustion the sequence is rolled back (no ref leak) and 0 is returned. ++ size_t place_with_prefix(int seq_id, const std::vector& token_ids); ++ ++ // Copy-on-write the block at logical index bi of seq_id. If that block is ++ // shared (ref_cnt>1), allocate a fresh private block, drop this seq's ref on ++ // the shared one (other owners keep it, content untouched) and install the ++ // fresh block at bi. Returns {old_block_id, new_block_id}; new==old when the ++ // block was already private (ref_cnt<=1) and no copy is needed. The caller ++ // copies the physical cell contents old_block_id -> new_block_id. ++ std::pair cow_block(int seq_id, size_t bi); ++ ++ // Introspection for the prefix-share gate (debug/tests). ++ int block_ref_cnt_at(int seq_id, size_t bi) const; ++ size_t num_blocks(int seq_id) const; ++ size_t num_free_blocks() const { return pool_.get_num_free_blocks(); } ++ + protected: + int block_size_; + BlockPool pool_; +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..7a5dabb21ee7 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0007-paged-engine-prefix-recompute-skip-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,534 @@ +From da20c1c0571e84bc76202d915d4bb82892a3392b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 12:46:28 +0200 +Subject: [PATCH] paged engine prefix recompute-skip (env LLAMA_KV_PAGED) - + patch 0007 + +Wire the host-side cross-request prefix cache (patch 0006) into the engine so a +new sequence physically SHARES the cached prefix blocks and skips recomputing the +shared prefix - the actual compute win that 0006 (which only proved the host-side +machinery + realised reuse via the stock seq_cp) did not yet deliver from the +paged path itself. + +Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical): + + * paged-alloc reworked from a per-stream, request-0, destroyed-on-free manager + into ONE persistent caching PagedKVManager per (kv-cache, stream) whose + requests are keyed by the real llama_seq_id. free(seq) now releases exactly + one sequence, so ref-counted shared blocks survive while another sharer holds + them. New seams: share_prefix (place_with_prefix -> shared prefix tokens), + slot, commit (publish a sequence into the content cache), ref-counted release, + plus ref/num-free introspection. + + * Two gated llama_kv_cache methods (the core seq-membership handling 0007 needs): + paged_prefix_share() reuses the longest cached content prefix for a sequence + and marks the shared physical cells as belonging to it (cells.seq_add) so the + engine's attention mask includes the already-computed prefix KV; the caller + then decodes ONLY the divergent suffix. paged_prefix_commit() publishes a + sequence's full blocks for later reuse. + + * find_slot's paged branch anchors placement on each sequence's own logical base + (ubatch.pos) and keys the manager request by seq_id, so an independently-freed + sequence and a shared prefix coexist in one unified pool. seq_rm/clear free + per-sequence (ref-counted) instead of nuking the whole stream. + + * paged-prefix-api: a thin gated shim so a caller holding only the public + llama.h can reach the seam and the introspection without the internal headers. + +Core existing-file touch: src/llama-kv-cache.{cpp,h}, +71 -3. Everything else is +additive vendored units. Verified on Qwen3-0.6B-Q8_0 (CPU, unified cache): a +sequence B sharing A's prefix decodes greedy tokens byte-identical to B from +scratch with the prefill computing ONLY the suffix (32 prefix tokens skipped) at +a block boundary AND mid-block; the shared block carries ref_cnt 2 while both +hold it, drops to 1 when one sharer is removed (survivor intact, re-shareable, no +use-after-free) and returns to the pool only when all sharers are freed. The +0004 serving gate (unified and non-unified) stays byte-identical stock vs paged. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/CMakeLists.txt | 1 + + src/llama-kv-cache.cpp | 66 +++++++++++++++++++++++-- + src/llama-kv-cache.h | 8 +++ + src/paged-alloc.cpp | 104 ++++++++++++++++++++++++++++++--------- + src/paged-alloc.h | 69 +++++++++++++++++++------- + src/paged-prefix-api.cpp | 48 ++++++++++++++++++ + src/paged-prefix-api.h | 27 ++++++++++ + 7 files changed, 280 insertions(+), 43 deletions(-) + create mode 100644 src/paged-prefix-api.cpp + create mode 100644 src/paged-prefix-api.h + +diff --git a/src/CMakeLists.txt b/src/CMakeLists.txt +index 4d9d7d1..432f42d 100644 +--- a/src/CMakeLists.txt ++++ b/src/CMakeLists.txt +@@ -27,6 +27,7 @@ add_library(llama + paged-kv-manager.cpp + paged-attn.cpp + paged-alloc.cpp ++ paged-prefix-api.cpp + llama-kv-cache-dsa.cpp + llama-memory.cpp + llama-memory-hybrid.cpp +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 1125d9a..7510ff9 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -419,7 +419,7 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) { + // removed (sequence end), so they return to the pool for reuse. + if (paged_alloc::active() && p0 == 0 && p1 == std::numeric_limits::max()) { + if (seq_id >= 0) { +- paged_alloc::release(this, (int) seq_to_stream[seq_id]); ++ paged_alloc::release(this, (int) seq_to_stream[seq_id], (int) seq_id); + } else { + paged_alloc::release_all(this); + } +@@ -1056,10 +1056,15 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch, + const uint32_t bs = 16; // block size (tokens/block) + const uint32_t nblk = cells.size() / bs; // this stream's block budget + if (nblk >= 2) { +- const uint32_t base = cells.get_used(); ++ // [paged 0007] Anchor placement on this sequence's own logical ++ // base position (ubatch.pos), not the shared used-count, and key ++ // the manager request by the real seq_id. slot(seq,pos) is then ++ // stable per sequence, so an independently-freed (ref-counted) ++ // sequence and a shared prefix can coexist in one unified pool. ++ const uint32_t base = (uint32_t) ubatch.pos[s*n_tokens]; + const int strm = (int) seq_to_stream[seq_id]; + std::vector placed; +- if (paged_alloc::place(this, strm, base, n_tokens, bs, nblk, placed)) { ++ if (paged_alloc::place(this, strm, (int) seq_id, base, n_tokens, bs, nblk, placed)) { + bool ok = (placed.size() == n_tokens); + for (uint32_t i = 0; ok && i < n_tokens; ++i) { + if (placed[i] >= cells.size() || !cells.is_empty(placed[i])) { +@@ -1165,6 +1170,61 @@ llama_kv_cache::slot_info llama_kv_cache::find_slot(const llama_ubatch & ubatch, + return res; + } + ++// [paged 0007] Cross-request prefix recompute-skip. ++// ++// Reuse a cached content prefix for seq_id: share_prefix() splices the longest ++// matching cached physical blocks into seq_id (ref_cnt++) and reserves fresh ++// blocks for the divergent suffix. We then mark the shared physical cells as ++// belonging to seq_id - those cells already hold the owner's computed KV at the ++// matching logical positions, so the caller decodes ONLY the suffix and the ++// prefix is never recomputed. Returns the number of shared prefix tokens. ++// Gated behind LLAMA_KV_PAGED; a no-op (returns 0) otherwise. ++int32_t llama_kv_cache::paged_prefix_share(llama_seq_id seq_id, const std::vector & tokens) { ++ if (!paged_alloc::active() || tokens.empty()) { ++ return 0; ++ } ++ const uint32_t bs = 16; ++ const uint32_t strm = (uint32_t) seq_to_stream[seq_id]; ++ auto & cells = v_cells[strm]; ++ const uint32_t nblk = cells.size() / bs; ++ if (nblk < 2) { ++ return 0; ++ } ++ ++ std::vector toks(tokens.begin(), tokens.end()); ++ const size_t kshare = paged_alloc::share_prefix(this, (int) strm, (int) seq_id, toks, bs, nblk); ++ ++ for (size_t p = 0; p < kshare; ++p) { ++ const int64_t cell = paged_alloc::slot(this, (int) strm, (int) seq_id, (int) p); ++ if (cell < 0 || (uint32_t) cell >= cells.size() || ++ cells.is_empty((uint32_t) cell) || ++ cells.pos_get((uint32_t) cell) != (llama_pos) p) { ++ // Owner cell missing / repurposed: cannot safely share. Roll the ++ // sequence back so the caller recomputes the whole prompt. ++ paged_alloc::release(this, (int) strm, (int) seq_id); ++ return 0; ++ } ++ if (!cells.seq_has((uint32_t) cell, seq_id)) { ++ cells.seq_add((uint32_t) cell, seq_id); ++ } ++ } ++ return (int32_t) kshare; ++} ++ ++// [paged 0007] Publish a sequence's full blocks into the content cache so a ++// later paged_prefix_share() can reuse them. Call after the sequence KV is ++// computed (its prefill decode has run). ++void llama_kv_cache::paged_prefix_commit(llama_seq_id seq_id, const std::vector & tokens) { ++ if (!paged_alloc::active() || tokens.empty()) { ++ return; ++ } ++ const uint32_t bs = 16; ++ const uint32_t strm = (uint32_t) seq_to_stream[seq_id]; ++ const uint32_t nblk = v_cells[strm].size() / bs; ++ std::vector toks(tokens.begin(), tokens.end()); ++ paged_alloc::commit(this, (int) strm, (int) seq_id, toks, bs, nblk); ++} ++ + void llama_kv_cache::apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch) { + // TODO: refactor [TAG_KV_CACHE_SHARE_CELLS] + if (other) { +diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h +index 494c0fb..f374ac6 100644 +--- a/src/llama-kv-cache.h ++++ b/src/llama-kv-cache.h +@@ -199,6 +199,14 @@ public: + // emplace the ubatch context into slot: [sinfo.idxs[0...ubatch.n_tokens - 1]] + void apply_ubatch(const slot_info & sinfo, const llama_ubatch & ubatch); + ++ // [paged 0007] Cross-request prefix recompute-skip (experimental, gated by ++ // env LLAMA_KV_PAGED). paged_prefix_share() reuses a cached content prefix ++ // for seq_id and returns the number of shared prefix tokens (the caller ++ // decodes only the suffix); paged_prefix_commit() publishes a sequence into ++ // the content cache for later reuse. No-ops when LLAMA_KV_PAGED is unset. ++ int32_t paged_prefix_share (llama_seq_id seq_id, const std::vector & tokens); ++ void paged_prefix_commit(llama_seq_id seq_id, const std::vector & tokens); ++ + // + // input API + // +diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp +index 1d13f9c..c1027fb 100644 +--- a/src/paged-alloc.cpp ++++ b/src/paged-alloc.cpp +@@ -23,9 +23,13 @@ namespace { + + using key_t = std::pair; + +-// One PagedKVManager per (kv-cache, stream): each stream owns a separate +-// physical pool of cells.size() cells, so a manager's block ids map directly to +-// cell ranges within that stream's pool. The internal request id is always 0. ++// One persistent PagedKVManager per (kv-cache, stream): each stream owns a ++// separate physical pool of cells.size() cells, so a manager's block ids map ++// directly to cell ranges within that stream's pool. Requests inside a manager ++// are keyed by the real llama_seq_id (NOT a fixed 0), so free(seq) releases one ++// sequence and shared blocks survive at ref>0 - this is what makes ref-counted ++// cross-request prefix sharing (0007) possible. Caching is enabled so commit() ++// can publish blocks and share_prefix() can hit them. + std::map> g_managers; + + paged::PagedKVManager * get_mgr(const void * cache, int stream, +@@ -33,18 +37,21 @@ paged::PagedKVManager * get_mgr(const void * cache, int stream, + const key_t k{cache, stream}; + auto it = g_managers.find(k); + if (it == g_managers.end()) { +- // enable_caching=false: prefix caching is a later patch; 0004 exercises +- // only on-demand allocate / free. + auto mgr = std::make_unique( +- (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/false); ++ (int32_t) pool_blocks, (int) block_size, /*enable_caching=*/true); + it = g_managers.emplace(k, std::move(mgr)).first; + } + return it->second.get(); + } + ++paged::PagedKVManager * find_mgr(const void * cache, int stream) { ++ auto it = g_managers.find({cache, stream}); ++ return it == g_managers.end() ? nullptr : it->second.get(); ++} ++ + } // namespace + +-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens, ++bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens, + uint32_t block_size, uint32_t pool_blocks, + std::vector & out) { + if (n_tokens == 0) { +@@ -53,43 +60,79 @@ bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens, + + paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size); + +- const size_t before = mgr->block_table(0).size(); ++ const size_t before = mgr->block_table(seq).size(); + +- // Grow the request to cover the highest logical position. The manager pops +- // free blocks only for the boundaries actually crossed - that is the on- +- // demand behavior; an already-covered range adds nothing. +- if (!mgr->allocate(0, (size_t) base + n_tokens)) { ++ // Grow this sequence's request to cover its highest logical position. The ++ // manager pops free blocks only for boundaries actually crossed; if ++ // share_prefix() already reserved these blocks, this is a no-op. ++ if (!mgr->allocate(seq, (size_t) base + n_tokens)) { + return false; // pool exhausted -> caller falls back to the stock path + } + + out.reserve(out.size() + n_tokens); + for (uint32_t i = 0; i < n_tokens; ++i) { +- const int64_t s = mgr->slot(0, (int) (base + i)); ++ const int64_t s = mgr->slot(seq, (int) (base + i)); + out.push_back((uint32_t) s); + } + + if (debug()) { +- const size_t after = mgr->block_table(0).size(); ++ const size_t after = mgr->block_table(seq).size(); + if (after != before) { + fprintf(stderr, +- "[paged-alloc] cache=%p stream=%d grew %zu->%zu blocks " ++ "[paged-alloc] cache=%p stream=%d seq=%d grew %zu->%zu blocks " + "(budget=%u; base=%u +%u tok)\n", +- cache, stream, before, after, pool_blocks, base, n_tokens); ++ cache, stream, seq, before, after, pool_blocks, base, n_tokens); + } + } + + return true; + } + +-void release(const void * cache, int stream) { +- auto it = g_managers.find({cache, stream}); +- if (it == g_managers.end()) { ++size_t share_prefix(const void * cache, int stream, int seq, ++ const std::vector & tokens, ++ uint32_t block_size, uint32_t pool_blocks) { ++ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size); ++ const size_t shared_blocks = mgr->place_with_prefix(seq, tokens); ++ const size_t shared_tokens = shared_blocks * (size_t) block_size; ++ if (debug() && shared_blocks > 0) { ++ fprintf(stderr, ++ "[paged-alloc] cache=%p stream=%d seq=%d shares %zu prefix blocks " ++ "(%zu tokens) - prefix NOT recomputed\n", ++ cache, stream, seq, shared_blocks, shared_tokens); ++ } ++ return shared_tokens; ++} ++ ++int64_t slot(const void * cache, int stream, int seq, int pos) { ++ paged::PagedKVManager * mgr = find_mgr(cache, stream); ++ if (!mgr) { ++ return -1; ++ } ++ if ((size_t) (pos / mgr->block_size()) >= mgr->num_blocks(seq)) { ++ return -1; ++ } ++ return mgr->slot(seq, pos); ++} ++ ++void commit(const void * cache, int stream, int seq, ++ const std::vector & tokens, uint32_t block_size, uint32_t pool_blocks) { ++ paged::PagedKVManager * mgr = get_mgr(cache, stream, pool_blocks, block_size); ++ mgr->cache_blocks(seq, mgr->compute_block_hashes(tokens), tokens.size()); ++ if (debug()) { ++ fprintf(stderr, "[paged-alloc] cache=%p stream=%d seq=%d committed %zu tokens\n", ++ cache, stream, seq, tokens.size()); ++ } ++} ++ ++void release(const void * cache, int stream, int seq) { ++ paged::PagedKVManager * mgr = find_mgr(cache, stream); ++ if (!mgr) { + return; + } +- it->second->free(0); +- g_managers.erase(it); ++ mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them + if (debug()) { +- fprintf(stderr, "[paged-alloc] released cache=%p stream=%d\n", cache, stream); ++ fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n", ++ cache, stream, seq, mgr->num_free_blocks()); + } + } + +@@ -103,4 +146,21 @@ void release_all(const void * cache) { + } + } + ++int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size) { ++ paged::PagedKVManager * mgr = find_mgr(cache, stream); ++ if (!mgr) { ++ return -1; ++ } ++ const size_t bi = (size_t) pos / block_size; ++ if (bi >= mgr->num_blocks(seq)) { ++ return -1; ++ } ++ return mgr->block_ref_cnt_at(seq, bi); ++} ++ ++size_t num_free(const void * cache, int stream) { ++ paged::PagedKVManager * mgr = find_mgr(cache, stream); ++ return mgr ? mgr->num_free_blocks() : 0; ++} ++ + } // namespace paged_alloc +diff --git a/src/paged-alloc.h b/src/paged-alloc.h +index bf66665..88dedef 100644 +--- a/src/paged-alloc.h ++++ b/src/paged-alloc.h +@@ -1,17 +1,28 @@ + #pragma once +-// On-demand paged KV block allocation (patch 0004, experimental). ++// On-demand paged KV block allocation + cross-request prefix reuse ++// (patches 0004 + 0007, experimental). + // +-// Backs the paged placement in llama_kv_cache::find_slot (patch 0002) with the +-// vendored host-side PagedKVManager (patch 0001). Instead of mapping a +-// sequence's logical positions onto a fixed full-pool permutation, blocks are +-// popped from a free pool ON DEMAND as the sequence crosses block boundaries, +-// and returned to the pool on sequence end. This is where the paged memory- +-// capacity benefit begins: a short sequence holds only a few blocks, not the +-// whole reserved window. ++// Backs the paged placement in llama_kv_cache::find_slot with the vendored ++// host-side PagedKVManager (patch 0001). Two responsibilities: + // +-// Gated behind env LLAMA_KV_PAGED; a no-op when unset. All state lives in this +-// unit (a static registry keyed by kv-cache + stream), so the core kv-cache +-// struct stays untouched - find_slot only gains a gated call. ++// * On-demand allocation (0004): a sequence's logical positions are mapped to ++// physical cells block-by-block, popped from a free pool only as the ++// sequence grows and returned on sequence end. ++// ++// * Cross-request prefix reuse (0007): before a new sequence's suffix is ++// decoded, share_prefix() reuses the cached physical blocks of a matching ++// content prefix (ref_cnt++), so the engine shares the already-computed KV ++// cells and the caller decodes ONLY the divergent suffix - the prefix is not ++// recomputed. commit() publishes a sequence's full blocks into the content ++// cache so later sequences can hit them. Freeing is ref-counted: a shared ++// block returns to the pool only when every sharer has been released. ++// ++// One persistent PagedKVManager per (kv-cache, stream); requests inside it are ++// keyed by the real llama_seq_id, so free(seq) releases exactly one sequence and ++// shared blocks survive at ref>0. All state lives in this unit (a static ++// registry), so the core kv-cache struct stays untouched - find_slot gains only ++// gated calls. Gated behind env LLAMA_KV_PAGED; a no-op when unset. + ++#include + #include + #include +@@ -21,19 +31,42 @@ namespace paged_alloc { + // true iff env LLAMA_KV_PAGED is set (evaluated once). + bool active(); + +-// Place n_tokens logical positions [base, base+n_tokens) of one stream on +-// demand, appending their physical cell indices to `out`. pool_blocks = +-// cells.size()/block_size is this stream's block budget. Returns false (leaving ++// Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq) ++// on demand, appending their physical cell indices to `out`. pool_blocks = ++// cells.size()/block_size is the stream's block budget. Returns false (leaving + // `out` unchanged) on pool exhaustion, so the caller falls back to the stock + // allocator. The caller still validates each returned cell is empty. +-bool place(const void * cache, int stream, uint32_t base, uint32_t n_tokens, ++bool place(const void * cache, int stream, int seq, uint32_t base, uint32_t n_tokens, + uint32_t block_size, uint32_t pool_blocks, + std::vector & out); + +-// Return a stream's blocks to the pool (sequence end). +-void release(const void * cache, int stream); ++// [0007] Reuse the longest cached content prefix of `tokens` for (cache,stream, ++// seq): splice the shared physical blocks into seq (ref_cnt++) and reserve fresh ++// blocks for the divergent suffix. Returns the number of shared PREFIX TOKENS ++// (block-aligned); the caller marks those cells for seq and decodes only the ++// suffix. 0 if nothing matched or on pool exhaustion (sequence rolled back). ++size_t share_prefix(const void * cache, int stream, int seq, ++ const std::vector & tokens, ++ uint32_t block_size, uint32_t pool_blocks); ++ ++// [0007] Physical cell backing logical position `pos` of (cache,stream,seq), or ++// -1 if seq is unknown. Used to map a shared prefix position to its cell. ++int64_t slot(const void * cache, int stream, int seq, int pos); + +-// Return every stream's blocks for a kv-cache (clear() / teardown). ++// [0007] Publish seq's full (block-aligned) blocks into the content cache so a ++// later share_prefix() can reuse them. Call after the sequence's KV is computed. ++void commit(const void * cache, int stream, int seq, ++ const std::vector & tokens, uint32_t block_size, uint32_t pool_blocks); ++ ++// Return one sequence's blocks to the pool (ref-counted; sequence end). ++void release(const void * cache, int stream, int seq); ++ ++// Drop every manager for a kv-cache (clear() / teardown). + void release_all(const void * cache); + ++// Introspection for the prefix-share gate (debug/tests). ref_cnt_at returns the ++// ref count of the block backing logical position `pos`, or -1 if unknown. ++int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size); ++size_t num_free(const void * cache, int stream); ++ + } // namespace paged_alloc +diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp +new file mode 100644 +index 0000000..8573cd2 +--- /dev/null ++++ b/src/paged-prefix-api.cpp +@@ -0,0 +1,48 @@ ++#include "paged-prefix-api.h" ++#include "paged-alloc.h" ++#include "llama-kv-cache.h" ++ ++#include ++ ++namespace paged_prefix_api { ++ ++static llama_kv_cache * kv_of(llama_context * ctx) { ++ // The driver targets a plain unified KV-cache model; dynamic_cast yields null ++ // for wrapped caches (iSWA / hybrid), where cross-request cell sharing does ++ // not apply, so the shim degrades to a safe no-op. ++ return dynamic_cast(llama_get_memory(ctx)); ++} ++ ++int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) { ++ llama_kv_cache * kv = kv_of(ctx); ++ if (!kv || n <= 0) { ++ return 0; ++ } ++ return kv->paged_prefix_share(seq, std::vector(tokens, tokens + n)); ++} ++ ++void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n) { ++ llama_kv_cache * kv = kv_of(ctx); ++ if (!kv || n <= 0) { ++ return; ++ } ++ kv->paged_prefix_commit(seq, std::vector(tokens, tokens + n)); ++} ++ ++int ref_at(llama_context * ctx, llama_seq_id seq, int pos) { ++ llama_kv_cache * kv = kv_of(ctx); ++ if (!kv) { ++ return -1; ++ } ++ return paged_alloc::ref_cnt_at((const void *) kv, /*stream=*/0, (int) seq, pos, /*block_size=*/16); ++} ++ ++long num_free(llama_context * ctx) { ++ llama_kv_cache * kv = kv_of(ctx); ++ if (!kv) { ++ return 0; ++ } ++ return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0); ++} ++ ++} // namespace paged_prefix_api +diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h +new file mode 100644 +index 0000000..78a3864 +--- /dev/null ++++ b/src/paged-prefix-api.h +@@ -0,0 +1,29 @@ ++#pragma once ++// Thin test/diagnostic shim over the paged cross-request prefix engine seam ++// (patch 0007). Lets a driver that only includes the public llama.h reach the ++// gated llama_kv_cache::paged_prefix_* methods and the paged-alloc introspection ++// without pulling in the internal kv-cache headers. All entry points are no-ops ++// (return 0) unless env LLAMA_KV_PAGED is set. Experimental; not a stable API. ++ ++#include ++#include ++#include "llama.h" ++ ++namespace paged_prefix_api { ++ ++// Reuse the longest cached content prefix of [tokens, tokens+n) for `seq` and ++// return the number of shared prefix tokens (the caller decodes only the ++// suffix). 0 if nothing was shared. ++int32_t share(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n); ++ ++// Publish `seq`'s full blocks into the content cache (call after its KV is computed). ++void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n); ++ ++// Ref count of the paged block backing logical position `pos` of `seq` (unified ++// stream 0), or -1 if unknown. ++int ref_at(llama_context * ctx, llama_seq_id seq, int pos); ++ ++// Number of free blocks in the unified stream-0 pool, or 0 if no manager. ++long num_free(llama_context * ctx); ++ ++} // namespace paged_prefix_api +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..a739919ff569 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0008-paged-server-cross-request-prefix-share-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,130 @@ +From 240758ef7e144619c750aaf1d3339051ecc29098 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 17:02:22 +0200 +Subject: [PATCH] paged server cross-request prefix share (env LLAMA_KV_PAGED) + - patch 0008 + +Wire the paged cross-request prefix recompute-skip (patch 0007's engine seam, +paged_prefix_api::share/commit) into the llama-server continuous-batching loop +(update_slots) so CONCURRENT requests that share a long prefix physically reuse +one committed copy of the prefix blocks and prefill only their divergent suffix. +Patch 0007 proved the engine seam correct via a standalone driver, but the server +never called it: two concurrent shared-prefix requests each recomputed the full +prefix. The server's native prompt cache only reuses a slot's OWN prior prompt +(longest-common-prefix vs slot.prompt.tokens) - it does not share across distinct +concurrent slots. 0008 adds that cross-slot share. + +Mechanism (all gated behind LLAMA_KV_PAGED; default off, stock byte-identical): + + * In update_slots prompt-processing, after the native n_past is computed and + only for a FRESH slot (n_past < one block, i.e. the native cache did not + already cover the prefix), call paged_prefix_api::share() to splice the + longest committed cross-request prefix into this sequence (ref_cnt++ on the + shared physical blocks) and advance n_past past it, so the batch fill computes + ONLY the suffix. The slot's own divergent tail cells are removed first so the + shared cells own [n_past, kshare) without colliding (the native path removes + these later anyway). The n_past < block gate guarantees any block-aligned + share the engine returns is strictly larger than n_past and therefore always + adopted, so the engine's reservation always matches the suffix-only batch and + never leaves stale blocks (which otherwise fragment the paged pool). + + * When a slot finishes prefill (SLOT_STATE_DONE_PROMPT -> GENERATING, the prefix + KV just computed), call paged_prefix_api::commit() to publish its prefix so + concurrent/later sharers can reuse it. + +The share() / commit() entry points are forward-declared (defined in libllama, +src/paged-prefix-api.cpp) to avoid pulling internal kv-cache headers into the +server translation unit. + +Verified in the server (32B NVFP4, CUDA, --kv-unified): with a live sequence +holding the prefix, K=16/32 concurrent shared-prefix requests prefill only their +~27-token suffix instead of the ~1003-token prefix (36x fewer prefill tokens; +K=16 23.9s -> 1.5s, K=32 57.9s -> 2.3s), the engine logs "shares ... prefix +blocks - NOT recomputed" with ref_cnt>1, and greedy output stays within the +documented CUDA batch-shape non-determinism band (stock native prompt-caching +shows the same magnitude). Cross-request sharing requires the unified KV cache. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tools/server/server-context.cpp | 50 +++++++++++++++++++++++++++++++++ + 1 file changed, 50 insertions(+) + +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index 39b7eb2..b5f9d37 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -16,6 +16,16 @@ + #include "mtmd.h" + #include "mtmd-helper.h" + ++// [paged 0008] Cross-request prefix recompute-skip shim. share()/commit() are ++// defined in libllama (src/paged-prefix-api.cpp, patch 0007) and are no-ops ++// unless env LLAMA_KV_PAGED is set. Declared here so the paged cross-slot prefix ++// cache wires into update_slots() without pulling in internal kv-cache headers. ++// Fully gated; stock (paged off) is byte-identical. ++namespace paged_prefix_api { ++ int32_t share (llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n); ++ void commit(llama_context * ctx, llama_seq_id seq, const llama_token * tokens, int n); ++} ++ + #include + #include + #include +@@ -3335,6 +3345,37 @@ private: + } + } + ++ // [paged 0008] Cross-request prefix recompute-skip. The native prompt cache ++ // above only reuses THIS slot's own prior prompt; when the paged KV ++ // engine is active, also reuse a committed CROSS-slot prefix so ++ // concurrent requests sharing a long prefix skip recompute. Gated on ++ // LLAMA_KV_PAGED (paged_kv_share static); stock stays byte-identical. ++ static const bool paged_kv_share = getenv("LLAMA_KV_PAGED") != nullptr; ++ // Only attempt the cross-request share on a FRESH slot (the native ++ // cache above did not already cover the prefix). With n_past < a ++ // block, any block-aligned share the engine returns is strictly ++ // larger than n_past and is therefore always adopted below - so the ++ // engine's full-prompt reservation always matches the suffix-only ++ // submission and never leaves stale blocks (which fragmented the ++ // paged pool and crashed the server under high fan-out otherwise). ++ if (paged_kv_share && n_past < 16 && slot.task->params.cache_prompt && !input_tokens.has_mtmd) { ++ const llama_tokens ptoks = input_tokens.get_text_tokens(); ++ // Drop this slot's own cells beyond the natively-cached prefix before ++ // splicing the shared physical prefix in, so the shared cells can own ++ // [n_past, kshare) without colliding (the native path removes exactly ++ // these later; a no-op for a fresh slot). ++ common_context_seq_rm(ctx_tgt, slot.id, n_past, -1); ++ const int32_t kshare = paged_prefix_api::share(ctx_tgt, slot.id, ptoks.data(), (int) ptoks.size()); ++ if (kshare > n_past) { ++ slot.prompt.tokens.keep_first(n_past); ++ for (int i = n_past; i < kshare; ++i) { ++ slot.prompt.tokens.push_back(ptoks[i]); ++ } ++ n_past = kshare; ++ SLT_INF(slot, "paged: reusing %d cross-request shared prefix tokens - not recomputed\n", n_past); ++ } ++ } ++ + // [TAG_PROMPT_LOGITS] + if (n_past == slot.task->n_tokens() && n_past > 0) { + SLT_WRN(slot, "need to evaluate at least 1 token for each active slot (n_past = %d, task.n_tokens() = %d)\n", n_past, slot.task->n_tokens()); +@@ -3741,6 +3782,15 @@ private: + // prompt evaluated for next-token prediction + slot.state = SLOT_STATE_GENERATING; + ++ // [paged 0008] Publish this slot's computed prefix so concurrent/later ++ // slots can share it (no-op unless LLAMA_KV_PAGED). The prefill decode ++ // for [0, n_tokens) has just run, so the prefix KV is computed. ++ static const bool paged_kv_commit = getenv("LLAMA_KV_PAGED") != nullptr; ++ if (paged_kv_commit && slot.task->params.cache_prompt && !slot.prompt.tokens.has_mtmd) { ++ const llama_tokens ctoks = slot.prompt.tokens.get_text_tokens(); ++ paged_prefix_api::commit(ctx_tgt, slot.id, ctoks.data(), (int) ctoks.size()); ++ } ++ + if (slot.can_speculate()) { + common_speculative_begin(spec.get(), slot.id, slot.prompt.tokens.get_text_tokens()); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch new file mode 100644 index 000000000000..342e313f854a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0009-paged-in-kernel-decode-read-env-LLAMA_KV_PAGED-patch.patch @@ -0,0 +1,609 @@ +From 59490d82e4d0d4ad05ffb5ca3cccc668f4a75281 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 20:03:17 +0200 +Subject: [PATCH] paged in-kernel decode read (env LLAMA_KV_PAGED) - patch 0009 + +Replace the per-layer per-step gather (patch 0003: ggml_get_rows of K/V into a +contiguous buffer) with an in-kernel paged read on the decode step. build_attn +passes the UNMODIFIED physical K/V views plus a block table (src[5] of +ggml_flash_attn_ext: an I32 [n_view, n_stream] position-ordered physical-cell +index, padded to FATTN_KQ_STRIDE). The CUDA fattn vec kernel and the CPU +reference map logical KV index j -> physical cell block_table[seq*ne11+j] and +read K_base+cell*nb11 / V_base+cell*nb21 in place, so the get_rows of K and V +(the bulk of the gather) is gone. The mask stays a small compacted [n_view] +causal mask in the same position order; KV_max / parallel_blocks / stream_k +split-K are unchanged. The decode shape is forced onto the vec kernel (the only +one wired for the block table); a nullptr block table => the stock contiguous +read, byte-identical. + +Token-POSITION ordering keeps the flash-attn reduction order identical to stock, +so CPU-paged logits == CPU-stock bit-for-bit (verified: 4-stream FA greedy, 64 +tokens). On GPU paged(vec) == stock(vec) at batch 1; at batch>1 it stays within +the documented vec-vs-mma non-determinism band. Decode step at batch 32 / 1024 +ctx on GB10 (Qwen3-32B NVFP4): paged-gather 1279 ms -> in-kernel 696 ms (-46%), +recovering the gather regression to stock parity (647 ms). Gated behind +LLAMA_KV_PAGED; no-op (stock byte-identical) when unset. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 6 ++ + ggml/src/ggml-cpu/ops.cpp | 10 ++- + ggml/src/ggml-cuda/fattn-common.cuh | 8 +- + ggml/src/ggml-cuda/fattn-mma-f16.cuh | 4 +- + ggml/src/ggml-cuda/fattn-tile.cuh | 4 +- + ggml/src/ggml-cuda/fattn-vec.cuh | 25 +++++-- + ggml/src/ggml-cuda/fattn-wmma-f16.cu | 4 +- + ggml/src/ggml-cuda/fattn.cu | 9 +++ + ggml/src/ggml.c | 14 ++++ + src/llama-graph.cpp | 23 ++++-- + src/llama-graph.h | 3 +- + src/llama-kv-cache.cpp | 31 ++++++++ + src/llama-kv-cache.h | 4 + + src/paged-attn.cpp | 107 +++++++++++++++++++++++++++ + src/paged-attn.h | 18 +++++ + 15 files changed, 248 insertions(+), 22 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index d6807b6..823f5a9 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2427,6 +2427,12 @@ extern "C" { + struct ggml_tensor * a, + struct ggml_tensor * sinks); + ++ // [paged] optional block table in src[5]: I32 [n_kv_logical, n_stream]; maps each ++ // logical KV index to the physical cell within K/V. nullptr => stock contiguous read. ++ GGML_API void ggml_flash_attn_ext_set_block_table( ++ struct ggml_tensor * a, ++ struct ggml_tensor * block_table); ++ + // TODO: needs to be adapted to ggml_flash_attn_ext + GGML_API struct ggml_tensor * ggml_flash_attn_back( + struct ggml_context * ctx, +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index 74611dc..63c07a2 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -8330,6 +8330,8 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk( + const ggml_tensor * v = dst->src[2]; + const ggml_tensor * mask = dst->src[3]; + const ggml_tensor * sinks = dst->src[4]; ++ const ggml_tensor * block_table = dst->src[5]; // [paged] logical->physical cell map (src[5]) ++ const int32_t * bt = block_table ? (const int32_t *) block_table->data : nullptr; + + GGML_TENSOR_LOCALS(int64_t, neq, q, ne) + GGML_TENSOR_LOCALS(size_t, nbq, q, nb) +@@ -8449,7 +8451,9 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk( + + float s; // KQ value + +- const char * k_data = (const char *) k->data + ( ic*nbk1 + ik2*nbk2 + ik3*nbk3); ++ // [paged] map the logical KV index ic to its physical cell via the block table. ++ const int64_t ic_phys = bt ? (int64_t) bt[ik3*nek1 + ic] : ic; ++ const char * k_data = (const char *) k->data + ( ic_phys*nbk1 + ik2*nbk2 + ik3*nbk3); + kq_vec_dot(DK, &s, 0, k_data, 0, Q_q, 0, 1); + + s = s*scale; // scale KQ value +@@ -8465,7 +8469,7 @@ static void ggml_compute_forward_flash_attn_ext_f16_one_chunk( + float ms = 1.0f; // upon new higher max val, scale VKQ and KQ sum with this value + float vs = 1.0f; // post-softmax KQ value, expf(s - M) + +- const char * v_data = ((const char *) v->data + (ic*nbv1 + iv2*nbv2 + iv3*nbv3)); ++ const char * v_data = ((const char *) v->data + (ic_phys*nbv1 + iv2*nbv2 + iv3*nbv3)); + + if (v->type == GGML_TYPE_F16) { + if (s > M) { +@@ -9021,7 +9025,7 @@ static void ggml_compute_forward_flash_attn_ext_f16( + const int64_t dr = (nr + nchunk - 1) / nchunk; + + static constexpr int64_t Q_TILE_SZ = ggml_fa_tile_config::Q; +- bool use_tiled = !use_ref && ++ bool use_tiled = !use_ref && dst->src[5] == nullptr && // [paged] one_chunk honors the block table + (q->type == GGML_TYPE_F32 && + kv_is_f32_or_f16 && + k->type == v->type && +diff --git a/ggml/src/ggml-cuda/fattn-common.cuh b/ggml/src/ggml-cuda/fattn-common.cuh +index 8dfa51a..3c6ddd5 100644 +--- a/ggml/src/ggml-cuda/fattn-common.cuh ++++ b/ggml/src/ggml-cuda/fattn-common.cuh +@@ -39,7 +39,8 @@ typedef void (* fattn_kernel_t)( + const int32_t nb11, const int32_t nb12, const int64_t nb13, + const int32_t nb21, const int32_t nb22, const int64_t nb23, + const int32_t ne31, const int32_t ne32, const int32_t ne33, +- const int32_t nb31, const int32_t nb32, const int64_t nb33); ++ const int32_t nb31, const int32_t nb32, const int64_t nb33, ++ const int * __restrict__ block_table); + + typedef float (*vec_dot_KQ_t)( + const char * __restrict__ K_c, const void * __restrict__ Q_v, const int * __restrict__ Q_q8 , const void * __restrict__ Q_ds); +@@ -981,6 +982,8 @@ void launch_fattn( + + const ggml_tensor * mask = dst->src[3]; + const ggml_tensor * sinks = dst->src[4]; ++ const ggml_tensor * block_table = dst->src[5]; // [paged] optional logical->physical map ++ const int * bt_ptr = block_table ? (const int *) block_table->data : nullptr; + + ggml_tensor * KQV = dst; + +@@ -1217,7 +1220,8 @@ void launch_fattn( + K->ne[0], K->ne[1], K->ne[2], K->ne[3], nb11, nb12, nb13, + nb21, nb22, nb23, + mask ? mask->ne[1] : 0, mask ? mask->ne[2] : 0, mask ? mask->ne[3] : 0, +- mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0 ++ mask ? mask->nb[1] : 0, mask ? mask->nb[2] : 0, mask ? mask->nb[3] : 0, ++ bt_ptr + ); + CUDA_CHECK(cudaGetLastError()); + +diff --git a/ggml/src/ggml-cuda/fattn-mma-f16.cuh b/ggml/src/ggml-cuda/fattn-mma-f16.cuh +index 83478a0..0a92cd6 100644 +--- a/ggml/src/ggml-cuda/fattn-mma-f16.cuh ++++ b/ggml/src/ggml-cuda/fattn-mma-f16.cuh +@@ -1723,7 +1723,9 @@ static __global__ void flash_attn_ext_f16( + const int32_t nb11, const int32_t nb12, const int64_t nb13, + const int32_t nb21, const int32_t nb22, const int64_t nb23, + const int32_t ne31, const int32_t ne32, const int32_t ne33, +- const int32_t nb31, const int32_t nb32, const int64_t nb33) { ++ const int32_t nb31, const int32_t nb32, const int64_t nb33, ++ const int * __restrict__ block_table) { ++ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel + ggml_cuda_pdl_sync(); // TODO optimize placement + #if defined(FLASH_ATTN_AVAILABLE) && (defined(VOLTA_MMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) || defined(AMD_MFMA_AVAILABLE)) + const char * GGML_CUDA_RESTRICT Q = Q_ptr; +diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh +index 0a09981..0ff14e6 100644 +--- a/ggml/src/ggml-cuda/fattn-tile.cuh ++++ b/ggml/src/ggml-cuda/fattn-tile.cuh +@@ -808,7 +808,9 @@ static __global__ void flash_attn_tile( + const int32_t nb11, const int32_t nb12, const int64_t nb13, + const int32_t nb21, const int32_t nb22, const int64_t nb23, + const int32_t ne31, const int32_t ne32, const int32_t ne33, +- const int32_t nb31, const int32_t nb32, const int64_t nb33) { ++ const int32_t nb31, const int32_t nb32, const int64_t nb33, ++ const int * __restrict__ block_table) { ++ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel + #ifdef FLASH_ATTN_AVAILABLE + const char * GGML_CUDA_RESTRICT Q = Q_ptr; + const char * GGML_CUDA_RESTRICT K = K_ptr; +diff --git a/ggml/src/ggml-cuda/fattn-vec.cuh b/ggml/src/ggml-cuda/fattn-vec.cuh +index 69dd936..a09e2fb 100644 +--- a/ggml/src/ggml-cuda/fattn-vec.cuh ++++ b/ggml/src/ggml-cuda/fattn-vec.cuh +@@ -39,7 +39,8 @@ static __global__ void flash_attn_ext_vec( + const int32_t nb11, const int32_t nb12, const int64_t nb13, + const int32_t nb21, const int32_t nb22, const int64_t nb23, + const int32_t ne31, const int32_t ne32, const int32_t ne33, +- const int32_t nb31, const int32_t nb32, const int64_t nb33) { ++ const int32_t nb31, const int32_t nb32, const int64_t nb33, ++ const int * __restrict__ block_table) { + ggml_cuda_pdl_lc(); + #ifdef FLASH_ATTN_AVAILABLE + const char * GGML_CUDA_RESTRICT Q = Q_ptr; +@@ -61,7 +62,7 @@ static __global__ void flash_attn_ext_vec( + nb11, nb12, nb13, + nb21, nb22, nb23, + ne31, ne32, ne33, +- nb31, nb32, nb33); ++ nb31, nb32, nb33, block_table); + NO_DEVICE_CODE; + return; + } +@@ -110,6 +111,14 @@ static __global__ void flash_attn_ext_vec( + K += nb13*sequence + nb12*(head / gqa_ratio); + V += nb23*sequence + nb22*(head / gqa_ratio); + ++ // [paged] in-kernel block-table read: logical KV index j -> physical cell ++ // block_table[sequence*ne11 + j]; read K0 + cell*nb11 / V0 + cell*nb21. The ++ // mask/KV_max stay logical (the table is in token-position order). nullptr => ++ // the stock contiguous read below. ++ const char * GGML_CUDA_RESTRICT K0 = K; ++ const char * GGML_CUDA_RESTRICT V0 = V; ++ const int * GGML_CUDA_RESTRICT bt = block_table ? block_table + (size_t) sequence*ne11 : nullptr; ++ + const half * maskh = (const half *) (mask + nb33*(sequence % ne33) + nb31*ic0); + + const float slope = get_alibi_slope(max_bias, head, n_head_log2, m0, m1); +@@ -267,10 +276,11 @@ static __global__ void flash_attn_ext_vec( + #pragma unroll + for (int i_KQ_0 = 0; i_KQ_0 < nthreads_KQ; ++i_KQ_0) { + const int i_KQ = threadIdx.y*WARP_SIZE + (nthreads_KQ == WARP_SIZE ? 0 : (threadIdx.x & ~(nthreads_KQ-1))) + i_KQ_0; ++ const char * GGML_CUDA_RESTRICT K_blk = bt ? (K0 + (int64_t) bt[k_VKQ_0 + i_KQ]*nb11) : (K + i_KQ*nb11); + + #pragma unroll + for (int j = 0; j < ncols; ++j) { +- float sum = vec_dot_KQ(K + i_KQ*nb11, Q_reg[j], Q_i32[j], Q_ds[j]); ++ float sum = vec_dot_KQ(K_blk, Q_reg[j], Q_i32[j], Q_ds[j]); + sum = warp_reduce_sum(sum); + + if (use_logit_softcap) { +@@ -324,6 +334,7 @@ static __global__ void flash_attn_ext_vec( + #pragma unroll + for (int k0 = 0; k0 < WARP_SIZE; k0 += V_cols_per_iter) { + const int k = threadIdx.y*WARP_SIZE + k0 + (nthreads_V == WARP_SIZE ? 0 : threadIdx.x / nthreads_V); ++ const char * GGML_CUDA_RESTRICT V_blk = bt ? (V0 + (int64_t) bt[k_VKQ_0 + k]*nb21) : (V + k*nb21); + + #ifdef V_DOT2_F32_F16_AVAILABLE + half2 KQ_k[ncols]; +@@ -336,14 +347,14 @@ static __global__ void flash_attn_ext_vec( + half2 tmp[V_rows_per_thread/2]; + if constexpr (type_V == GGML_TYPE_BF16) { + float2 tmp_f[V_rows_per_thread/2]; +- dequantize_V(V + k*nb21, tmp_f, ++ dequantize_V(V_blk, tmp_f, + 2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread); + #pragma unroll + for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) { + tmp[i_VKQ_1] = __float22half2_rn(tmp_f[i_VKQ_1]); + } + } else { +- dequantize_V(V + k*nb21, tmp, ++ dequantize_V(V_blk, tmp, + 2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread); + } + #pragma unroll +@@ -363,7 +374,7 @@ static __global__ void flash_attn_ext_vec( + #pragma unroll + for (int i_VKQ_0 = 0; i_VKQ_0 < D/2; i_VKQ_0 += nthreads_V*V_rows_per_thread/2) { + float2 tmp[V_rows_per_thread/2]; +- dequantize_V(V + k*nb21, tmp, ++ dequantize_V(V_blk, tmp, + 2*i_VKQ_0 + (nthreads_V == WARP_SIZE ? threadIdx.x : threadIdx.x % nthreads_V)*V_rows_per_thread); + #pragma unroll + for (int i_VKQ_1 = 0; i_VKQ_1 < V_rows_per_thread/2; ++i_VKQ_1) { +@@ -522,7 +533,7 @@ static __global__ void flash_attn_ext_vec( + nb11, nb12, nb13, + nb21, nb22, nb23, + ne31, ne32, ne33, +- nb31, nb32, nb33); ++ nb31, nb32, nb33, block_table); + NO_DEVICE_CODE; + #endif // FLASH_ATTN_AVAILABLE + } +diff --git a/ggml/src/ggml-cuda/fattn-wmma-f16.cu b/ggml/src/ggml-cuda/fattn-wmma-f16.cu +index 6850716..5357849 100644 +--- a/ggml/src/ggml-cuda/fattn-wmma-f16.cu ++++ b/ggml/src/ggml-cuda/fattn-wmma-f16.cu +@@ -44,7 +44,9 @@ static __global__ void flash_attn_ext_f16( + const int32_t nb11, const int32_t nb12, const int64_t nb13, + const int32_t nb21, const int32_t nb22, const int64_t nb23, + const int32_t ne31, const int32_t ne32, const int32_t ne33, +- const int32_t nb31, const int32_t nb32, const int64_t nb33) { ++ const int32_t nb31, const int32_t nb32, const int64_t nb33, ++ const int * __restrict__ block_table) { ++ GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel + #if defined(FLASH_ATTN_AVAILABLE) && (defined(GGML_HIP_ROCWMMA_FATTN) && defined(GGML_USE_WMMA_FATTN)) + const char * GGML_CUDA_RESTRICT Q = Q_ptr; + const char * GGML_CUDA_RESTRICT K = K_ptr; +diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu +index d6c501b..e3771ee 100644 +--- a/ggml/src/ggml-cuda/fattn.cu ++++ b/ggml/src/ggml-cuda/fattn.cu +@@ -574,6 +574,15 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d + + void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { + ggml_cuda_set_device(ctx.device); ++ ++ // [paged] the block table (src[5]) is only honored by the vec kernel's ++ // in-kernel read; force it. build_attn only sets it for a vec-supported ++ // 1-token-per-stream decode shape. ++ if (dst->src[5] != nullptr) { ++ ggml_cuda_flash_attn_ext_vec(ctx, dst); ++ return; ++ } ++ + switch (ggml_cuda_get_best_fattn_kernel(ggml_cuda_get_device(), dst)) { + case BEST_FATTN_KERNEL_NONE: + GGML_ABORT("fatal error"); +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index b43016c..adbe52b 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -5442,6 +5442,20 @@ void ggml_flash_attn_ext_add_sinks( + a->src[4] = sinks; + } + ++void ggml_flash_attn_ext_set_block_table( ++ struct ggml_tensor * a, ++ struct ggml_tensor * block_table) { ++ if (!block_table) { ++ a->src[5] = NULL; ++ return; ++ } ++ ++ GGML_ASSERT(a->op == GGML_OP_FLASH_ATTN_EXT); ++ GGML_ASSERT(block_table->type == GGML_TYPE_I32); ++ ++ a->src[5] = block_table; ++} ++ + // ggml_flash_attn_back + + struct ggml_tensor * ggml_flash_attn_back( +diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp +index b59d2a5..abdb48d 100644 +--- a/src/llama-graph.cpp ++++ b/src/llama-graph.cpp +@@ -2074,7 +2074,8 @@ ggml_tensor * llm_graph_context::build_attn_mha( + ggml_tensor * sinks, + ggml_tensor * v_mla, + float kq_scale, +- int il) const { ++ int il, ++ ggml_tensor * block_table) const { + const bool v_trans = v->nb[1] > v->nb[2]; + + // split the batch into streams if needed +@@ -2109,6 +2110,9 @@ ggml_tensor * llm_graph_context::build_attn_mha( + hparams.attn_soft_cap ? hparams.f_attn_logit_softcapping : 0.0f); + cb(cur, LLAMA_TENSOR_NAME_FATTN, il); + ++ if (block_table) { ++ ggml_flash_attn_ext_set_block_table(cur, block_table); ++ } + ggml_flash_attn_ext_add_sinks(cur, sinks); + ggml_flash_attn_ext_set_prec (cur, GGML_PREC_F32); + +@@ -2358,12 +2362,19 @@ ggml_tensor * llm_graph_context::build_attn( + ggml_tensor * k = mctx_cur->get_k(ctx0, il); + ggml_tensor * v = mctx_cur->get_v(ctx0, il); + +- // [paged 0003] gather K, V and the mask to the sequence's used cells only +- // (no-op unless env LLAMA_KV_PAGED is set). +- ggml_tensor * kq_mask_g = kq_mask; +- paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g); ++ // [paged] decode read: when paging is active and this is a 1-token-per-stream ++ // decode step, present K/V as n_gather views + a block table so the fattn ++ // kernel reads the sequence's cells in-kernel (no get_rows of K/V). Else ++ // fall back to the gather-read (prefill, transposed V, or env off). All a ++ // no-op unless env LLAMA_KV_PAGED is set => stock byte-identical. ++ ggml_tensor * kq_mask_g = kq_mask; ++ ggml_tensor * block_table = nullptr; ++ const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream ++ if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) { ++ paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g); ++ } + +- ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il); ++ ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table); + cb(cur, "kqv_out", il); + + if (inp->self_v_rot) { +diff --git a/src/llama-graph.h b/src/llama-graph.h +index 5e8a658..c95ae49 100644 +--- a/src/llama-graph.h ++++ b/src/llama-graph.h +@@ -969,7 +969,8 @@ struct llm_graph_context { + ggml_tensor * sinks, // [n_head_q] + ggml_tensor * v_mla, // [n_embd_head_v_mla, n_embd_head_v, n_head_v] + float kq_scale, +- int il) const; ++ int il, ++ ggml_tensor * block_table = nullptr) const; // [paged] optional src[5] block table + + llm_graph_input_attn_no_cache * build_attn_inp_no_cache() const; + +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 7510ff9..0351f86 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -1474,6 +1474,33 @@ void llama_kv_cache::get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_in + } + } + ++void llama_kv_cache::get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const { ++ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1; ++ for (uint32_t j = 0; j < ns; ++j) { ++ const auto & cells = v_cells[sinfo.s0 + j]; ++ const uint32_t n = std::min(n_kv, cells.size()); ++ std::vector> pc; ++ pc.reserve(n); ++ int32_t pad = -1; ++ for (uint32_t i = 0; i < n; ++i) { ++ if (!cells.is_empty(i)) { ++ pc.emplace_back(cells.pos_get(i), (int32_t) i); ++ } else if (pad < 0) { ++ pad = (int32_t) i; ++ } ++ } ++ std::sort(pc.begin(), pc.end()); ++ int32_t * col = dst + (size_t) j * n_blk; ++ for (size_t k = 0; k < pc.size(); ++k) { ++ col[k] = pc[k].second; ++ } ++ const int32_t padv = (pad >= 0) ? pad : (pc.empty() ? 0 : pc.back().second); ++ for (uint32_t k = (uint32_t) pc.size(); k < n_blk; ++k) { ++ col[k] = padv; ++ } ++ } ++} ++ + ggml_tensor * llama_kv_cache::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const { + GGML_UNUSED(sinfo); + +@@ -2773,6 +2800,10 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const { + kv->get_gather_idxs(dst, n_kv, sinfos[i_cur]); + } + ++void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const { ++ kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]); ++} ++ + ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const { + return kv->cpy_k(ctx, k_cur, k_idxs, il, sinfos[i_cur]); + } +diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h +index f374ac6..e9980b6 100644 +--- a/src/llama-kv-cache.h ++++ b/src/llama-kv-cache.h +@@ -176,6 +176,9 @@ public: + // gather-read. get_n_gather returns the max count across streams. + uint32_t get_n_gather(uint32_t n_kv, const slot_info & sinfo) const; + void get_gather_idxs(int32_t * dst, uint32_t n_kv, const slot_info & sinfo) const; ++ // [paged inc1] block table [n_blk, n_stream] (position order, padded to n_blk ++ // per column with a masked empty cell) for the in-kernel paged read. ++ void get_block_table(int32_t * dst, uint32_t n_blk, uint32_t n_kv, const slot_info & sinfo) const; + + // store k_cur and v_cur in the cache based on the provided head location + ggml_tensor * cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il, const slot_info & sinfo) const; +@@ -386,6 +389,7 @@ public: + // current ubatch's stream). + uint32_t get_n_gather() const; + void get_gather_idxs(int32_t * dst) const; ++ void get_block_table(int32_t * dst, uint32_t n_blk) const; + + // store k_cur and v_cur in the cache based on the provided head location + // note: the heads in k_cur and v_cur should be laid out contiguously in memory +diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp +index ade75e8..8eebeaa 100644 +--- a/src/paged-attn.cpp ++++ b/src/paged-attn.cpp +@@ -43,6 +43,25 @@ public: + ggml_tensor * idxs; + }; + ++// Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream] ++// tensor with each stream's position-ordered cells, padded to n_blk (per column) ++// with a masked empty cell, by delegating to the kv-cache context. ++class input_block_table : public llm_graph_input_i { ++public: ++ input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk) ++ : mctx(mctx), idxs(idxs), n_blk(n_blk) {} ++ ++ void set_input(const llama_ubatch * ubatch) override { ++ GGML_UNUSED(ubatch); ++ GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer)); ++ mctx->get_block_table((int32_t *) idxs->data, n_blk); ++ } ++ ++ const llama_kv_cache_context * mctx; ++ ggml_tensor * idxs; ++ uint32_t n_blk; ++}; ++ + } // namespace + + void gather(ggml_context * ctx0, +@@ -125,4 +144,92 @@ void gather(ggml_context * ctx0, + } + } + ++bool in_kernel_decode(ggml_context * ctx0, ++ llm_graph_result * res, ++ const llama_kv_cache_context * mctx, ++ ggml_tensor ** k, ++ ggml_tensor ** v, ++ ggml_tensor ** kq_mask, ++ ggml_tensor ** block_table) { ++ if (!active()) { ++ return false; ++ } ++ // Bench escape hatch: LLAMA_KV_PAGED_GATHER=1 forces the old gather-read decode ++ // path (for a same-build BEFORE/AFTER decode-step comparison). Dev-only. ++ static const bool force_gather = (std::getenv("LLAMA_KV_PAGED_GATHER") != nullptr); ++ if (force_gather) { ++ return false; ++ } ++ ++ ggml_tensor * K = *k; ++ ggml_tensor * V = *v; ++ ggml_tensor * M = *kq_mask; ++ ++ const int64_t n_stream = K->ne[3]; ++ GGML_ASSERT(M->ne[3] == n_stream); ++ ++ const int64_t n_gather = (int64_t) mctx->get_n_gather(); ++ if (n_gather <= 0) { ++ // Worst-case reserve / nothing placed yet: keep the dense [0,n_kv) read. ++ return false; ++ } ++ ++ // The in-kernel read addresses V along its d-major (non-transposed) axis. If ++ // the cache stores V transposed, fall back to gather() (which normalizes it). ++ if (V->nb[1] > V->nb[2]) { ++ return false; ++ } ++ ++ if (debug()) { ++ static int64_t once = 0; ++ if (once++ < 2) { ++ fprintf(stderr, "[paged-attn] in-kernel decode n_stream=%lld n_kv=%lld n_gather=%lld\n", ++ (long long) n_stream, (long long) K->ne[2], (long long) n_gather); ++ } ++ } ++ ++ // Block table [n_gather, n_stream]: column s holds stream s's non-empty cells ++ // in token-POSITION order (identical to the gather index, so the reduction ++ // order matches stock bit-for-bit), padded with a masked empty cell. Filled ++ // at set_input from the kv-cache (get_gather_idxs), exactly like the gather. ++ // Pad the logical length to FATTN_KQ_STRIDE (256) so the CUDA fattn vec kernel ++ // reads fixed 128-wide KV blocks without overrun and the KV_max mask scan ++ // engages; padded entries point at a masked empty cell (0 contribution). Stays ++ // <= n_kv since n_kv is itself padded to 256 and n_gather <= n_kv. ++ int64_t n_view = GGML_PAD(n_gather, 256); ++ if (n_view > K->ne[2]) { ++ n_view = K->ne[2]; ++ } ++ ++ ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream); ++ ggml_set_input(idx); ++ res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view))); ++ ++ // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window: ++ // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell ++ // dim shrinks to n_view. NOT materialized - the kernel reads in place. ++ *k = ggml_view_4d(ctx0, K, K->ne[0], K->ne[1], n_view, n_stream, ++ K->nb[1], K->nb[2], K->nb[3], 0); ++ *v = ggml_view_4d(ctx0, V, V->ne[0], V->ne[1], n_view, n_stream, ++ V->nb[1], V->nb[2], V->nb[3], 0); ++ ++ // Compact the mask to [n_gather, n_tps, 1, ns] in the same position order so ++ // the kernel's logical mask index aligns with the block table. Cheap: the ++ // mask is ~(d*h) smaller than K/V, which is why only its get_rows remains. ++ { ++ ggml_tensor * m = ggml_reshape_3d(ctx0, M, M->ne[0], M->ne[1], n_stream); ++ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); ++ m = ggml_get_rows(ctx0, m, idx); ++ m = ggml_cont(ctx0, ggml_transpose(ctx0, m)); ++ m = ggml_reshape_4d(ctx0, m, n_view, M->ne[1], 1, n_stream); ++ if (M->type != m->type) { ++ m = ggml_cast(ctx0, m, M->type); ++ } ++ *kq_mask = m; ++ } ++ ++ *block_table = idx; ++ return true; ++} ++ + } // namespace paged_attn +diff --git a/src/paged-attn.h b/src/paged-attn.h +index c5b7bd7..23e2184 100644 +--- a/src/paged-attn.h ++++ b/src/paged-attn.h +@@ -37,4 +37,22 @@ void gather(ggml_context * ctx0, + ggml_tensor ** v, + ggml_tensor ** kq_mask); + ++// [paged inc1] In-kernel paged decode read. Instead of materializing the ++// sequence's cells (gather()), present K and V as n_gather-length VIEWS of the ++// full physical window and return the position-ordered physical-cell index list ++// as a block table (src[5] of ggml_flash_attn_ext). The fattn kernel/op then ++// reads K_base + block_table[j]*nb in-kernel, removing the get_rows of K and V ++// (the bulk of the gather). On return (true): *k,*v point at the views, *kq_mask ++// at the compacted mask, *block_table at the I32 [n_gather, n_stream] index. ++// Returns false (leaving *k,*v,*kq_mask untouched) when the in-kernel path does ++// not apply - env off, nothing placed, or a transposed V cache - so the caller ++// keeps the dense gather()/contiguous read. ++bool in_kernel_decode(ggml_context * ctx0, ++ llm_graph_result * res, ++ const llama_kv_cache_context * mctx, ++ ggml_tensor ** k, ++ ggml_tensor ** v, ++ ggml_tensor ** kq_mask, ++ ggml_tensor ** block_table); ++ + } // namespace paged_attn +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch new file mode 100644 index 000000000000..1e6a5a57fd5e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0010-paged-tile-in-kernel-read-and-dispatch-guard-env-LLAMA_KV_PAGED.patch @@ -0,0 +1,269 @@ +From 9ac56933abd5de4a1f349c811c2d74aab09f7ab1 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 22 Jun 2026 22:36:09 +0200 +Subject: [PATCH] paged tile in-kernel decode read + dispatch guard (env + LLAMA_KV_PAGED) - patch 0010 + +Increment 2 (robustness, ~0 headline ms): make the paged in-kernel decode read +safe against silent mis-routing, and plumb the same read into the tile kernel +for the increment-3 GQA head-group work. + +fattn-tile.cuh: graft the patch-0009 phys(j) block-table read (mirror of +fattn-vec.cuh). Both flash_attn_tile_load_tile overloads, flash_attn_tile_iter_KQ +(K) and flash_attn_tile_iter (V) take an optional per-sequence block table; a row +i is read from base + block_table[row_base + i]*stride instead of base + i*stride. +The table defaults to nullptr (default args + a null bt_seq when src[5] is unset), +so every existing non-paged caller is byte-identical to stock. The mask / KV_max +stay logical (token-position order), as in vec. + +fattn.cu: DISPATCH GUARD. When the block table (src[5]) is present, route ONLY to +the vec or tile kernel and never fall through to the best-kernel switch. The +mma/wmma kernels GGML_UNUSED the table and would silently read the wrong +(contiguous physical) cells; the guard makes that unreachable. The vec dispatcher +GGML_ABORTs for an unsupported D/type rather than mis-reading. Default route is vec +(the inc-1 byte-validated path). LLAMA_KV_PAGED_DISPATCH_LOG=1 prints the routed +kernel once. + +Gates: CPU byte-identical paged-on vs off (Qwen3-0.6B, build-cpu) PASS. GPU +vec-paged == stock at -s 1 PASS. Dispatch confirmed VEC for the real decode shape: +Qwen3-0.6B Q ne=[128,1,16,1] and Qwen3-32B NVFP4 Q ne=[128,1,64,N] both route to +vec, matching the nsys profile (flash_attn_ext_vec). + +The tile graft is plumbed for increment-3 GQA head-group reuse but is EXPERIMENTAL +and NOT yet byte-validated (LLAMA_KV_PAGED_TILE=1). A tile-vs-tile gate shows +tile-paged diverging from tile-stock at the first cross-tile KV depth: the +GQA-grouped (ncols2>1) tile path reads a full nbatch_fa-row tile with +oob_check=false while the compacted paged mask is not padded to cover the tile, so +past-end rows leak. vec bounds its KV walk by KV_max and is unaffected. Bounding +the tile path is increment-3 work; the default vec route and all stock paths are +untouched. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/fattn-tile.cuh | 45 ++++++++++++++++++++----------- + ggml/src/ggml-cuda/fattn.cu | 38 +++++++++++++++++++++++--- + 2 files changed, 64 insertions(+), 19 deletions(-) + +diff --git a/ggml/src/ggml-cuda/fattn-tile.cuh b/ggml/src/ggml-cuda/fattn-tile.cuh +index 0ff14e6..bb84d61 100644 +--- a/ggml/src/ggml-cuda/fattn-tile.cuh ++++ b/ggml/src/ggml-cuda/fattn-tile.cuh +@@ -373,7 +373,8 @@ static constexpr __device__ int ggml_cuda_fattn_tile_get_nbatch_K(const int DKQ, + // TODO: deduplicate with mma-f16 + template + static __device__ __forceinline__ void flash_attn_tile_load_tile( +- const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup) { ++ const half2 * const __restrict__ KV, half2 * const __restrict__ tile_KV, const int stride_KV, const int i_sup, ++ const int * const __restrict__ block_table = nullptr, const int row_base = 0) { + constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes(); + constexpr int cpy_ne = cpy_nb / 4; + +@@ -402,9 +403,11 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile( + const int j = j0*cpy_ne + (stride_j == warp_size ? threadIdx.x : threadIdx.x % stride_j)*cpy_ne; + + const __align__(16) half2 zero[cpy_ne] = {{0.0f, 0.0f}}; ++ // [paged] remap the row through the block table (nullptr => stock contiguous read). ++ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV; + ggml_cuda_memcpy_1( + tile_KV + i*(J/2 + J_padding) + j, +- !oob_check || i < i_sup ? KV + i*stride_KV + j : zero); ++ !oob_check || i < i_sup ? KV_row + j : zero); + } + } + } +@@ -423,7 +426,8 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile( + + template + static __device__ __forceinline__ void flash_attn_tile_load_tile( +- const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup) { ++ const half2 * const __restrict__ KV, float * const __restrict__ tile_KV, const int stride_KV, const int i_sup, ++ const int * const __restrict__ block_table = nullptr, const int row_base = 0) { + constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes(); + constexpr int cpy_ne = cpy_nb / 4; + +@@ -453,8 +457,10 @@ static __device__ __forceinline__ void flash_attn_tile_load_tile( + + const half2 zero[cpy_ne/2] = {{0.0f, 0.0f}}; + __align__(16) half2 tmp_h2[cpy_ne/2]; ++ // [paged] remap the row through the block table (nullptr => stock contiguous read). ++ const half2 * const KV_row = block_table ? KV + (int64_t) block_table[row_base + i]*stride_KV : KV + i*stride_KV; + ggml_cuda_memcpy_1( +- tmp_h2, !oob_check || i < i_sup ? KV + i*stride_KV + j : zero); ++ tmp_h2, !oob_check || i < i_sup ? KV_row + j : zero); + + __align__(16) float2 tmp_f2[cpy_ne/2]; + #pragma unroll +@@ -487,6 +493,7 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ( + const int k_VKQ_0, + const int k_VKQ_sup, + const int k_KQ_0, ++ const int * const __restrict__ block_table, + float * KQ_acc) { + constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes(); + constexpr int cpy_ne = cpy_nb / 4; +@@ -495,8 +502,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter_KQ( + constexpr int cpw = ncols > nwarps ? ncols/nwarps : 1; // Q columns per warp + constexpr int np = nwarps > ncols ? nwarps/ncols : 1; // number of parallel warps per Q column + ++ // [paged] when block_table is set K_h2 is the un-offset base; the table supplies the row. ++ const half2 * const K_base = block_table ? (K_h2 + k_KQ_0/2) : (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2); + flash_attn_tile_load_tile +- (K_h2 + int64_t(k_VKQ_0)*stride_K2 + k_KQ_0/2, KV_tmp, stride_K2, k_VKQ_sup); ++ (K_base, KV_tmp, stride_K2, k_VKQ_sup, block_table, k_VKQ_0); + __syncthreads(); + + #ifdef FAST_FP16_AVAILABLE +@@ -572,7 +581,8 @@ static __device__ __forceinline__ void flash_attn_tile_iter( + T_acc * const VKQ, + const int k_VKQ_0, + const int k_VKQ_max, +- const int col_Q_0) { ++ const int col_Q_0, ++ const int * const __restrict__ block_table) { + constexpr int cpy_nb = ggml_cuda_get_max_cpy_bytes(); + constexpr int cpy_ne = cpy_nb / 4; + +@@ -605,12 +615,12 @@ static __device__ __forceinline__ void flash_attn_tile_iter( + #pragma unroll + for (int k_KQ_0 = 0; k_KQ_0 < DKQ - nbatch_K_last; k_KQ_0 += nbatch_K) { + flash_attn_tile_iter_KQ( +- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc); ++ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc); + } + if (nbatch_K_last > 0) { + constexpr int k_KQ_0 = DKQ - nbatch_K_last; + flash_attn_tile_iter_KQ( +- Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, KQ_acc); ++ Q_tmp, K_h2, KV_tmp, stride_K2, k_VKQ_0, k_VKQ_sup, k_KQ_0, block_table, KQ_acc); + } + + // Apply logit softcap + mask, update KQ_max: +@@ -715,8 +725,10 @@ static __device__ __forceinline__ void flash_attn_tile_iter( + static_assert(nbatch_V % np == 0, "bad nbatch_V"); + #pragma unroll + for (int k0 = 0; k0 < nbatch_fa; k0 += nbatch_V) { ++ // [paged] when block_table is set V_h2 is the un-offset base; the table supplies the row. ++ const half2 * const V_base = block_table ? V_h2 : (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2); + flash_attn_tile_load_tile +- (V_h2 + int64_t(k_VKQ_0 + k0)*stride_V2, KV_tmp, stride_V2, k_VKQ_sup - k0); ++ (V_base, KV_tmp, stride_V2, k_VKQ_sup - k0, block_table, k_VKQ_0 + k0); + __syncthreads(); + + #ifdef FAST_FP16_AVAILABLE +@@ -810,7 +822,6 @@ static __global__ void flash_attn_tile( + const int32_t ne31, const int32_t ne32, const int32_t ne33, + const int32_t nb31, const int32_t nb32, const int64_t nb33, + const int * __restrict__ block_table) { +- GGML_UNUSED(block_table); // [paged] block table is honored only by the vec kernel + #ifdef FLASH_ATTN_AVAILABLE + const char * GGML_CUDA_RESTRICT Q = Q_ptr; + const char * GGML_CUDA_RESTRICT K = K_ptr; +@@ -837,7 +848,7 @@ static __global__ void flash_attn_tile( + nb11, nb12, nb13, + nb21, nb22, nb23, + ne31, ne32, ne33, +- nb31, nb32, nb33); ++ nb31, nb32, nb33, block_table); + NO_DEVICE_CODE; + return; + } +@@ -861,6 +872,10 @@ static __global__ void flash_attn_tile( + const half2 * K_h2 = (const half2 *) (K + nb13*sequence + nb12*(head0 / gqa_ratio)); + const half2 * V_h2 = (const half2 *) (V + nb23*sequence + nb22*(head0 / gqa_ratio)); // K and V have same shape + ++ // [paged] per-sequence logical->physical block table in token-position order ++ // (mask/KV_max stay logical); nullptr => the stock contiguous read. ++ const int * const __restrict__ bt_seq = block_table ? block_table + (size_t) sequence*ne11 : nullptr; ++ + const half * maskh = mask ? (const half *) (mask + nb33*(sequence % ne33)) : nullptr; + + const int stride_K2 = nb11 / sizeof(half2); +@@ -963,14 +978,14 @@ static __global__ void flash_attn_tile( + constexpr bool oob_check = false; + flash_attn_tile_iter + (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp, +- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0); ++ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq); + k_VKQ_0 += gridDim.y*nbatch_fa; + } + if (k_VKQ_0 < k_VKQ_max) { + constexpr bool oob_check = true; + flash_attn_tile_iter + (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp, +- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0); ++ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq); + } + } else { + // Branch without out-of-bounds checks. +@@ -978,7 +993,7 @@ static __global__ void flash_attn_tile( + constexpr bool oob_check = false; + flash_attn_tile_iter + (Q_tmp, K_h2, V_h2, maskh, ne01, logit_softcap, slope, KQ, KV_tmp, +- stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0); ++ stride_K2, stride_V2, stride_mask, KQ_max, KQ_sum, VKQ, k_VKQ_0, k_VKQ_max, col_Q_0, bt_seq); + } + } + +@@ -1144,7 +1159,7 @@ static __global__ void flash_attn_tile( + nb11, nb12, nb13, + nb21, nb22, nb23, + ne31, ne32, ne33, +- nb31, nb32, nb33); ++ nb31, nb32, nb33, block_table); + NO_DEVICE_CODE; + #endif // FLASH_ATTN_AVAILABLE + } +diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu +index e3771ee..afcafa2 100644 +--- a/ggml/src/ggml-cuda/fattn.cu ++++ b/ggml/src/ggml-cuda/fattn.cu +@@ -575,11 +575,41 @@ size_t ggml_cuda_flash_attn_ext_get_alloc_size(int device, const ggml_tensor * d + void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { + ggml_cuda_set_device(ctx.device); + +- // [paged] the block table (src[5]) is only honored by the vec kernel's +- // in-kernel read; force it. build_attn only sets it for a vec-supported +- // 1-token-per-stream decode shape. ++ // [paged] DISPATCH GUARD. The block table (src[5]) is read in-kernel ONLY by ++ // the vec and tile kernels; the mma/wmma kernels GGML_UNUSED it and would ++ // silently read the wrong (contiguous physical) cells. So when a block table ++ // is present we route here and NEVER fall through to the best-kernel switch ++ // below - no decode shape can silently reach an mma/wmma misread. build_attn ++ // only sets src[5] for the 1-token-per-stream decode shape; the vec ++ // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading, ++ // and any shape that should not be paged must take the host-side gather path ++ // (LLAMA_KV_PAGED_GATHER=1) instead. ++ // ++ // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and ++ // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the ++ // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the ++ // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte- ++ // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile ++ // with oob_check=false while the compacted paged mask is not padded to cover ++ // it, so it diverges from stock. Not for production paged decode until ++ // increment-3 bounds that path; the default vec route is unaffected. + if (dst->src[5] != nullptr) { +- ggml_cuda_flash_attn_ext_vec(ctx, dst); ++ static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr; ++ if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) { ++ static bool logged = false; ++ if (!logged) { ++ logged = true; ++ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n", ++ paged_tile ? "TILE(experimental)" : "VEC", ++ (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1], ++ (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]); ++ } ++ } ++ if (paged_tile) { ++ ggml_cuda_flash_attn_ext_tile(ctx, dst); ++ } else { ++ ggml_cuda_flash_attn_ext_vec(ctx, dst); ++ } + return; + } + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch new file mode 100644 index 000000000000..795fa6a7297b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0011-paged-decode-route-GQA-grouped-tile-kernel-by-defaul.patch @@ -0,0 +1,147 @@ +From d5ca5cd756e42214d0003bca815ca91943679b0d Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 00:18:35 +0200 +Subject: [PATCH] paged decode: route GQA-grouped tile kernel by default (F16, + gqa>=2) - patch 0011 + +Increment 3 (the attention lever). In fattn.cu's paged dispatch guard, route the +in-kernel decode to the tile kernel for the common grouped-query F16 case, and +keep the inc-1 vec kernel for everything else. + +The tile kernel carries native GQA head-group reuse: its ncols2 axis groups the +q-heads that share one kv-head, so each K/V row is loaded once for the whole +group instead of once per q-head. vec re-streams each kv-head's K/V once per +q-head (8x for Qwen3-32B's n_head 64 / n_head_kv 8) and runs at 168 regs -> +3 blocks/SM = 25% occupancy on GB10; tile is 108-128 regs with native grouping. +The inc-2 phys(j) block-table read was already plumbed into tile (patch 0010); +this patch makes it the default for {F16 K and V, gqa_ratio >= 2}. + +Routing guard (why conditional): the tile kernel has no K/V type template - it +loads half2 - so a non-F16 cache (BF16 / quantized) would be converted by +launch_fattn to a contiguous F16 copy, which breaks the in-kernel block-table +read (the table indexes the original paged layout, not the copy). So tile is +correct only for an F16 cache; non-F16 caches and the non-grouped gqa==1 shape +fall back to the inc-1 vec path, exactly as before this change. The head-group +reuse also only helps at gqa_ratio >= 2. LLAMA_KV_PAGED_VEC=1 forces vec for A/B. +Note: paged decode is currently exercised with an F16 cache only; quantized + +paged is a separate pre-existing limitation, independent of this change +(verified: stock + q8_0 cache works, but paged + q8_0 aborts both before and +after this patch, since both route the non-F16 cache to vec). + +Measured GB10 (sm_121, 48 SM), Qwen3-32B NVFP4 dense, F16 cache, gqa 8, batch 32, +1024 ctx, llama-batched-bench npp=1024 ntg=128 npl=32, GGML_CUDA_DISABLE_GRAPHS=1, +same build, env-toggled: + STOCK (mma) 174.8 ms/step 183.1 t/s + PAGED-VEC (inc-1) 186.3 ms/step 171.8 t/s (+6.6% vs stock) + PAGED-TILE (inc-3) 177.9 ms/step 179.8 t/s (+1.8% vs stock) +GQA grouping recovers 8.4 ms/step (-4.5%) over the inc-1 vec default and brings +paged decode to within 1.8% of stock. The win grows with context (npl=8, tile vs +vec decode step): 1024 -2.3%, 4096 -3.3%, 8192 and 16384 wider, as attention +takes a larger share of the step. + +Why not the split-K tune: the vec decode grid is already block-saturated +(1 x parallel_blocks 3 x 2048 = 6144 blocks ~ 43 waves over 144 resident on 48 +SM), so raising parallel_blocks / KV_max adds no SM fill. The under-saturation is +intra-SM (occupancy + the 8x KV re-streaming), which GQA grouping attacks +directly; more split-K does not. + +Correctness (greedy, GGML_CUDA_DISABLE_GRAPHS=1): + - CPU plumbing gate (Qwen3-0.6B, build-cpu, paged-on vs off): BYTE-IDENTICAL. + - GPU 0.6B gqa=2, 8 seq x 48 tok: tile is token-identical to the inc-1 vec path + in 7/8 sequences; the 8th diverges at token 5, within the same kernel-noise + band where vec also drifts from stock. Stock uses the mma kernel for this + multi-stream GQA shape, so a different kernel = different rounding = + autoregressive token drift; vec and tile agree with each other while both + differ from stock (both pick 15678 where stock picks 38835), confirming the + drift is kernel choice, not a paging error. + - GPU 32B gqa=8, 4 seq x 40 tok: tile tracks stock at least as well as vec + (seq3: tile == stock == 624 at the token where vec picked 13). + +Stock is byte-identical: the dispatch guard only diverts when the block table +(src[5]) is set; the non-paged best-kernel switch is untouched. The ncols2>1 tile +path reads the last nbatch_fa tile with oob_check=false and relies on the mask +-inf padding - the same pattern stock uses for ncols2>1 - and the compacted paged +mask is gathered to the n_view (GGML_PAD 256) width so it carries that padding. + +Signed-off-by: Ettore Di Giacinto +Assisted-by: Claude:opus-4.8 [Claude Code] +--- + ggml/src/ggml-cuda/fattn.cu | 51 ++++++++++++++++++++++++++----------- + 1 file changed, 36 insertions(+), 15 deletions(-) + +diff --git a/ggml/src/ggml-cuda/fattn.cu b/ggml/src/ggml-cuda/fattn.cu +index afcafa2..6b15810 100644 +--- a/ggml/src/ggml-cuda/fattn.cu ++++ b/ggml/src/ggml-cuda/fattn.cu +@@ -580,32 +580,53 @@ void ggml_cuda_flash_attn_ext(ggml_backend_cuda_context & ctx, ggml_tensor * dst + // silently read the wrong (contiguous physical) cells. So when a block table + // is present we route here and NEVER fall through to the best-kernel switch + // below - no decode shape can silently reach an mma/wmma misread. build_attn +- // only sets src[5] for the 1-token-per-stream decode shape; the vec ++ // only sets src[5] for the 1-token-per-stream decode shape; the vec/tile + // dispatcher GGML_ABORTs for an unsupported D/type rather than mis-reading, + // and any shape that should not be paged must take the host-side gather path + // (LLAMA_KV_PAGED_GATHER=1) instead. + // +- // Default route = vec (inc-1, byte-validated: vec-paged == stock at -s 1 and +- // CPU byte-identical). LLAMA_KV_PAGED_TILE=1 routes the same shape to the +- // tile kernel; the tile in-kernel read is plumbed (fattn-tile.cuh) for the +- // increment-3 GQA head-group reuse, but is EXPERIMENTAL / NOT yet byte- +- // validated: the GQA-grouped (ncols2>1) tile path reads a full nbatch_fa tile +- // with oob_check=false while the compacted paged mask is not padded to cover +- // it, so it diverges from stock. Not for production paged decode until +- // increment-3 bounds that path; the default vec route is unaffected. ++ // Default route = the GQA-grouped TILE kernel (inc-3) WHEN it is both correct ++ // and a win, else the inc-1 vec path. Tile groups the q-heads that share one ++ // kv-head (ncols2), loading each K/V row once for the whole group instead of ++ // once per q-head, and runs at higher occupancy than vec (108-128 regs vs 168). ++ // Two constraints make this conditional: (1) the tile kernel has no K/V type ++ // template - it loads half2 - so a non-F16 cache (BF16/quantized) would be ++ // converted by launch_fattn to a contiguous F16 copy, which breaks the ++ // in-kernel block-table read (the table indexes the original paged layout, not ++ // the copy); vec instead reads the original cache with in-kernel dequant, so it ++ // is the only correct paged path for non-F16 caches. (2) the head-group reuse ++ // only helps when gqa_ratio>=2. So route to tile only for {F16 K and V, ++ // gqa_ratio>=2}; everything else stays on vec, matching stock (which also sends ++ // quantized-cache decode to the vector kernel). Measured on GB10 (Qwen3-32B ++ // nvfp4, F16 cache, gqa 8, batch 32, 1024 ctx): tile 177.9 ms/step vs vec 186.3 ++ // vs stock 174.8 - GQA grouping recovers ~4.5% over the inc-1 vec default and ++ // brings paged decode to ~1.8% of stock. Validated token-coherent with vec: ++ // 0.6B 8-seq 7/8 identical (8th within the kernel-noise band where vec also ++ // drifts from stock), 32B gqa=8 tile tracks stock at least as well as vec, CPU ++ // plumbing gate byte-identical. The ncols2>1 tile path reads the last nbatch_fa ++ // tile with oob_check=false relying on mask -inf padding (the SAME pattern stock ++ // uses for ncols2>1); the compacted paged mask is gathered to the n_view ++ // (GGML_PAD 256) width so it carries that padding. LLAMA_KV_PAGED_VEC=1 forces ++ // the inc-1 vec path for A/B. + if (dst->src[5] != nullptr) { +- static const bool paged_tile = getenv("LLAMA_KV_PAGED_TILE") != nullptr; ++ const ggml_tensor * Qp = dst->src[0]; ++ const ggml_tensor * Kp = dst->src[1]; ++ const ggml_tensor * Vp = dst->src[2]; ++ const bool kv_f16 = Kp->type == GGML_TYPE_F16 && Vp->type == GGML_TYPE_F16; ++ const int64_t gqa_ratio = Kp->ne[2] > 0 ? Qp->ne[2] / Kp->ne[2] : 1; ++ const bool force_vec = getenv("LLAMA_KV_PAGED_VEC") != nullptr; ++ const bool use_tile = !force_vec && kv_f16 && gqa_ratio >= 2; + if (getenv("LLAMA_KV_PAGED_DISPATCH_LOG") != nullptr) { + static bool logged = false; + if (!logged) { + logged = true; +- fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld])\n", +- paged_tile ? "TILE(experimental)" : "VEC", +- (long) dst->src[0]->ne[0], (long) dst->src[0]->ne[1], +- (long) dst->src[0]->ne[2], (long) dst->src[0]->ne[3]); ++ fprintf(stderr, "[paged] decode src[5] set -> routing to %s (Q ne=[%ld,%ld,%ld,%ld] gqa=%ld kv_f16=%d)\n", ++ use_tile ? "TILE(gqa)" : "VEC", ++ (long) Qp->ne[0], (long) Qp->ne[1], (long) Qp->ne[2], (long) Qp->ne[3], ++ (long) gqa_ratio, (int) kv_f16); + } + } +- if (paged_tile) { ++ if (use_tile) { + ggml_cuda_flash_attn_ext_tile(ctx, dst); + } else { + ggml_cuda_flash_attn_ext_vec(ctx, dst); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0012-paged-mask-pad-invariant-assert.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0012-paged-mask-pad-invariant-assert.patch new file mode 100644 index 000000000000..548fe9c2141a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0012-paged-mask-pad-invariant-assert.patch @@ -0,0 +1,50 @@ +From 6e3e976e2b11adb05519f31dd5aad0c204678f5c Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 11:12:05 +0200 +Subject: [PATCH] feat(paged): assert mask-pad invariant for the paged tile + route (patch 0012) + +The now-default paged decode route (GQA-grouped fattn-tile kernel) does not +leak past-end KV rows only because the compacted mask/block-table length is +padded to a whole number of flash-attn KV tiles: n_view = GGML_PAD(n_gather, +256), and the tile (nbatch_fa = 64 for head_dim 128) divides 256, so the last +tile sits entirely inside the -inf pad window. That invariant was implicit. + +Add a defensive GGML_ASSERT(n_view % 64 == 0) right after the pad/clamp so a +future change to the pad (e.g. < 256) or the tile (> 256) that broke the +whole-tile property cannot silently reintroduce the leak. Additive only, no +behaviour change. + +Verified: build-cpu compiles, and the paged CPU byte gate (LLAMA_KV_PAGED off +vs on, Qwen3-0.6B-Q8_0, greedy, -ngl 0) stays byte-identical while the assert +stays silent (n_view remains a whole number of tiles across all decode steps). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/paged-attn.cpp | 9 +++++++++ + 1 file changed, 9 insertions(+) + +diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp +index 8eebeaa..fed8ca9 100644 +--- a/src/paged-attn.cpp ++++ b/src/paged-attn.cpp +@@ -201,6 +201,15 @@ bool in_kernel_decode(ggml_context * ctx0, + n_view = K->ne[2]; + } + ++ // The flash-attn KV tile is 64 rows wide (nbatch_fa for head_dim 128). n_view must be ++ // a whole number of such tiles so the in-kernel decode never reads past the gathered ++ // rows: the trailing pad cells [n_gather, n_view) are all -inf, so any tile straddling ++ // the boundary still contributes zero. This holds today only because the pad (256) is a ++ // multiple of the tile; a future pad < 256 (or nbatch_fa > 256) that broke it would ++ // silently reintroduce a past-end KV leak, so assert it rather than trust it. ++ // pad must be a multiple of the flash-attn KV tile so the last tile is fully inside the -inf pad ++ GGML_ASSERT(n_view % 64 == 0); ++ + ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream); + ggml_set_input(idx); + res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view))); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0013-paged-decoupled-prefill-token-budget.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0013-paged-decoupled-prefill-token-budget.patch new file mode 100644 index 000000000000..29a9ca2260e2 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0013-paged-decoupled-prefill-token-budget.patch @@ -0,0 +1,136 @@ +From 6d3743105c1bbfbf9cd16c0c0ba39bfaac74216e Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 11:52:45 +0200 +Subject: [PATCH] feat(paged): decoupled per-step prefill-token budget (patch + 0013) + +llama-server already co-batches decode with chunked prefill: update_slots() +appends every generating slot's sampled token first, then fills the rest of the +n_batch budget with prompt tokens, deferring the overflow to the next step. But +the prefill chunk size is hard-wired to n_batch (default 2048): one slot's +~2048-token prefill chunk lands in a single compute-heavy step, and every decode +co-batched into that step sees a multi-second inter-token-latency (ITL) spike. +Lowering n_batch shrinks the chunk but also caps decode-concurrency width and +prefill throughput, because they are coupled. + +Add LLAMA_PREFILL_BUDGET: a per-step prefill-token budget decoupled from n_batch +(the analogue of vLLM's --max-num-batched-tokens / long_prefill_token_threshold). +The prompt-fill loop and the outer slot loop now also stop once this many prompt +tokens have been added in the current update_slots() step, so a long prefill is +split across more steps that each still advance in-flight decode. Default (env +unset or <= 0) = disabled, so stock behaviour is byte-identical. Orthogonal to +LLAMA_KV_PAGED: this is a pure scheduler knob and works with paged off. + +Measured on GB10 (sm_121), dense Qwen3-32B-NVFP4, paged build, 8 steady decode +streams with one 6000-token prefill injected mid-stream; same binary, only +LLAMA_PREFILL_BUDGET differs: + + metric stock(off) budget=256 budget=512 + worst decode freeze (ms) 3380 482 (7.0x) 778 (4.3x) + median decode ITL in window 2264 411 (5.5x) 689 + decode_stall (ms) 3285 387 (8.5x) 684 (4.8x) + decode steps during prefill 38 201 (5.3x) 108 + injected-req TTFT (ms) 8493 10172 (+20%) 8432 (~0%) + steady-state baseline ITL 94 95 94 + +This is a LATENCY/fairness lever, not an aggregate-throughput lever: it flattens +the decode ITL spike a long prefill inflicts on co-batched decoders (8.5x smaller +worst freeze and 5.3x more decode progress during the prefill at budget=256), in +exchange for a modest TTFT rise on the long request (the classic chunked-prefill +trade-off; budget=512 buys 4.8x with ~no TTFT cost). Steady aggregate decode is +unchanged: it is bandwidth/weight-capped on GB10 (the NVFP4 weight-read floor), +which the scheduler cannot lift. + +Correctness (same model, greedy temp 0, fa on): +- budget unset or >= n_batch: byte-identical to stock (the added break never + fires before the existing n_batch break; the off-path is a no-op by + construction). +- short prompt (<= budget): byte-identical to stock. +- the knob is exactly equivalent to stock's native -b chunking: budget=512 == + stock -b512 and budget=256 == stock -b256, both BYTE-IDENTICAL, while keeping + n_batch=2048 for decode width. +- on a prompt larger than the budget the chunked greedy output diverges from the + single n_batch chunk only by intrinsic flash-attn chunk-size FP grouping: PURE + stock -b256 diverges from stock -b2048 the same way with the patch inactive, + and the output stays coherent and answers correctly. + +Productisation (LocalAI): surface as a model options knob (max_prefill_tokens / +mpt) parsed in grpc-server.cpp, default 0 = disabled, per CHUNKED_PREFILL_PLAN +Phase B; the vendored update_slots() hunk here is that plan's scheduler patch and +stays disjoint from the paged allocation hunks. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tools/server/server-context.cpp | 34 ++++++++++++++++++++++++++++++++- + 1 file changed, 33 insertions(+), 1 deletion(-) + +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index b5f9d37..afcdebe 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -3043,6 +3043,29 @@ private: + int32_t n_batch = llama_n_batch(ctx_tgt); + int32_t n_ubatch = llama_n_ubatch(ctx_tgt); + ++ // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget. ++ // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt ++ // tokens ingested per update_slots() step at n_batch only; with cont_batching the ++ // sampled decode tokens of every generating slot are appended FIRST, then prompt ++ // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch ++ // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every ++ // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt ++ // tokens added per step independently of n_batch, splitting a long prefill across ++ // more steps so in-flight decode keeps advancing smoothly. Default (env unset or ++ // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED ++ // (this is a pure scheduler knob; works with paged off). ++ int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking) ++ { ++ const char * env_pb = getenv("LLAMA_PREFILL_BUDGET"); ++ if (env_pb) { ++ const int v = atoi(env_pb); ++ if (v > 0) { ++ n_prefill_budget = std::min(n_batch, std::max(1, v)); ++ } ++ } ++ } ++ int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots) ++ + auto & alora_scale = batch.alora_scale; + auto & alora_disabled_id = batch.alora_disabled_id; + +@@ -3487,7 +3510,10 @@ private: + const auto last_user_pos = spans.last_user_message_pos(); + + // add prompt tokens for processing in the current batch +- while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch) { ++ // (patch 0013) also stop once the per-step prefill budget is spent, so a long ++ // prompt is split across more steps and leaves batch room for co-batched decode ++ while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch && ++ (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) { + // get next token to process + llama_token cur_tok = input_tokens[slot.prompt.n_tokens()]; + if (cur_tok == LLAMA_TOKEN_NULL) { +@@ -3512,6 +3538,7 @@ private: + slot.prompt.tokens.push_back(cur_tok); + + slot.n_prompt_tokens_processed++; ++ n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget + + // stop the prompt batch exactly before a user message + if (spans.is_user_start(slot.prompt.n_tokens())) { +@@ -3597,6 +3624,11 @@ private: + if (!slot_batched) { + slot_batched = &slot; + } ++ // (patch 0013) stop adding prompts once the per-step prefill budget is spent, ++ // leaving the remaining batch capacity for co-batched decode of other slots ++ if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) { ++ add_ok = false; ++ } + }); + } + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch new file mode 100644 index 000000000000..fc9ff66b5a52 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0014-paged-expert-aware-moe-token-tile-cap.patch @@ -0,0 +1,140 @@ +From 652b858252b354f4d4fb49e5ed7468eeee8e32fc Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 15:47:06 +0200 +Subject: [PATCH] feat(paged): expert-aware MoE token-tile cap (patch 0014) + +On GB10 (sm_121) the Qwen3-30B-A3B-class mxfp4 MoE decode path already uses the +sorted grouped FP4-MMA GEMM (MUL_MAT_ID -> ggml_cuda_mul_mat_q ids branch: +mm_ids_helper moe_align/scatter + one persistent stream-k mul_mat_q), so the +originally reported npl128 throughput cliff does NOT reproduce on this build. +llama-batched-bench decode (S_TG t/s) is monotonic across batch: + + npl 1 8 32 64 128 256 + S_TG 85 282 629 935 1295 1779 (stock, mxfp4 MoE, -fa on) + +There is no knee to erase; the old cliff (a real high-batch regression, 620 t/s +at npl128) was fixed upstream by grouped-mmq + MoE stream-k load balancing. + +What remains is a pure tile-shape micro-inefficiency. In mul_mat_q_case the +token-tile width mmq_x is chosen to cover ncols_max (= ne12, the per-expert +column upper bound = token count, up to 128) in one column-tile. At MoE decode +the per-expert token density is ~ne12*k/n_experts (top-8 of 128 => ~1/16 of +ne12, e.g. ~8 tokens/expert at npl128), so each expert's single mmq_x-wide +col-tile is only ~6% filled: the MMA accumulator tile is mmq_x-wide at compile +time and burns throughput on the padding columns while the larger y-tile lowers +occupancy. Stock picks the LARGEST tile (128) where the SMALLEST tile that still +covers the density would raise fill + occupancy at no extra weight read (at +tokens/expert <= mmq_x there is exactly one non-empty col-tile per expert; the +emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k +kernel) - the inverse of vLLM's small per-expert BLOCK_SIZE_M. + +Add LLAMA_MOE_MMQ_X: an env cap on mmq_x for the MUL_MAT_ID path only +(expert_bounds != nullptr). Default (unset or <= 0) = disabled, so the mmq_x +selection, and therefore every kernel launched, is byte-identical to stock. The +cap only ever lowers the loop's upper bound and still selects from the same +granularity- and shared-memory-validated mmq_x set stock already uses for +smaller batches, so no new kernel configuration is exercised. + +Measured on GB10, qwen3coder-mxfp4.gguf, -fa on, -npp 128 -ntg 128, same binary, +only LLAMA_MOE_MMQ_X differs (decode S_TG t/s / prefill S_PP t/s): + + npl stock S_TG cap64 S_TG d% stock S_PP cap64 S_PP + 64 936 938 +0.1 2924 2883 + 128 1295 1357 +4.8 3075 3038 + 256 1784 1825 +2.3 3085 3046 + + (reproduced across interleaved reps; cap64 npl128 = 1357.5/1357.0, very stable) + +cap64 lifts high-batch decode +4.8% (npl128) / +2.3% (npl256), neutral at +npl <= 64, for a consistent ~1.3% prefill cost. Smaller caps are net-negative: +cap16 / cap32 crater prefill -41% / -17% (a 512-token prefill ubatch has ~32 +tokens/expert, which overflows a 16/32-wide tile into extra col-tiles + weight +re-reads), so 64 is the recommended value and the only one that helps net. + +Honest framing: this is NOT a cliff fix (no cliff exists) and not a real-server +throughput unlock (llama-server continuous batching already scales). It is a +modest high-effective-batch DECODE micro-optimization that matches vLLM's +smaller per-expert M-tiling, surfaced as an opt-in, default-off knob. The +durable density-aware auto-select (drop the blunt global cap, choose mmq_x from +ne_get_rows / n_active_experts so prefill keeps its large tile) is scoped in +patches/paged/MOE_GROUPED_GEMM_SCOPE.md. + +Correctness: greedy temp-0 llama-server output with cap64 is byte-identical to +stock for single-stream generation (fibonacci / capital-of-France / photosynthesis +prompts) and stays coherent; batched-bench ran thousands of capped MoE matmuls at +npl128/256 (mmq_x forced 128 -> 64) with no CUDA error / NaN and stable output. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 37 ++++++++++++++++++++++++++++++++++++- + 1 file changed, 36 insertions(+), 1 deletion(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index edf546d..cff608e 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -6,6 +6,7 @@ + + #include + #include ++#include + + using namespace ggml_cuda_mma; + +@@ -4052,6 +4053,18 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + } + } + ++// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X. ++// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical). ++// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the ++// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case). ++static inline int ggml_cuda_moe_mmq_x_cap() { ++ static const int cap = []() -> int { ++ const char * s = getenv("LLAMA_MOE_MMQ_X"); ++ return s ? atoi(s) : 0; ++ }(); ++ return cap; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4063,10 +4076,32 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int mmq_x_max = get_mmq_x_max_host(cc); + const int mmq_y = get_mmq_y_host(cc); + ++ // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap. ++ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are ++ // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, ++ // up to 128) in a single column-tile. At MoE decode the per-expert token density is low ++ // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for ++ // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty: ++ // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the ++ // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the ++ // per-expert density raises tile fill + occupancy with no extra weight reads (at ++ // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the ++ // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel). ++ // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock; ++ // off the ids path the cap never applies. ++ int mmq_x_lim = mmq_x_max; ++ if (args.expert_bounds != nullptr) { ++ const int moe_cap = ggml_cuda_moe_mmq_x_cap(); ++ if (moe_cap > 0) { ++ const int cap = moe_cap < 8 ? 8 : moe_cap; ++ mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max; ++ } ++ } ++ + int mmq_x_best = 0; + int ntiles_x_best = INT_MAX; + +- for (int mmq_x = 8; mmq_x <= mmq_x_max && ntiles_x_best > 1; mmq_x += 8) { ++ for (int mmq_x = 8; mmq_x <= mmq_x_lim && ntiles_x_best > 1; mmq_x += 8) { + const int granularity = mmq_get_granularity_host(mmq_x, cc); + + if (mmq_x % granularity != 0 || mmq_get_nbytes_shared(mmq_x, mmq_y, cc, warp_size, nwarps) > smpbo) { +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch new file mode 100644 index 000000000000..519ad7ab1c3e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0015-paged-expert-density-aware-moe-token-tile-auto-select.patch @@ -0,0 +1,238 @@ +From 5349f8231b1e11214f5e8a668129397fb6e2f9ac Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 23 Jun 2026 21:03:00 +0200 +Subject: [PATCH] feat(paged): expert-density-aware MoE token-tile auto-select + (patch 0015) + +The durable follow-up to patch 0014's blunt LLAMA_MOE_MMQ_X global cap (which the +0014 doc itself scoped): replace the manual env cap with a host-side, default-on +auto-select inside mul_mat_q_case that picks a small token-tile (mmq_x) for the +MUL_MAT_ID grouped FP4-MMA GEMM only when the per-expert token density is low +(decode), and keeps the large 128-wide tile when density is high (prefill). No new +kernel: the selection only lowers the loop's upper bound to an already-compiled, +granularity- and shared-memory-validated mmq_x. + +Density is estimated host-side from the args the ids path already passes: + ne_get_rows = ncols_dst = ne12 * n_expert_used (token-expert assignments) + n_experts = nchannels_x = ne02 + density = ceil(ne_get_rows / min(ne_get_rows, n_experts)) (tokens/expert) +Cap to the small tile (default 64) only when density <= density_max. Unlike 0014's +global cap, the high-density prefill ubatch stays on the big tile, so S_PP does not +regress by construction. + +density_max default = 8 (not tile/4 = 16). The cap must fire for decode but not for +a prefill ubatch, and each has per-expert density n_tokens*n_used/n_experts. At the +standard n_ubatch=512, n_used=8: prefill density = 4096/n_experts (32 at 128 experts, +16 at 256), decode at npl<=128 is <= 1024/n_experts (8 at 128, 4 at 256). Default 8 +sits strictly between for every n_experts in [128,511], so it caps decode and leaves +prefill on the big tile. tile/4 (=16) equalled the 256-expert prefill density and +cratered its S_PP by ~2%, the regression this threshold exists to avoid. + +Measured on GB10 (sm_121), Qwen3.6-35B-A3B NVFP4 (256 experts, top-8, GDN linear +attention), llama-batched-bench -fa on -npp 128 -ntg 128, default-on vs stock +(LLAMA_MOE_AUTO_TILE=0), median of 5 reps: + + npl S_TG stock S_TG 0015 dTG% S_PP stock S_PP 0015 dPP% + 8 183.59 183.18 -0.22% 1489.2 1500.1 +0.73% + 32 264.02 263.44 -0.22% 2034.5 2033.5 -0.05% + 64 311.76 310.41 -0.43% 2028.3 2027.6 -0.03% + 128 336.10 337.32 +0.36% 2025.0 2027.7 +0.13% + +Honest read: on THIS model the decode effect is within run-to-run noise (neutral) +and prefill is neutral. q36-35b-a3b decode is bound by the GDN/SSM recurrence and +256 tiny-expert weight bandwidth, not the MoE col-tile occupancy, so the col-tile +lever (worth +4.8% @npl128 on Qwen3-Coder-30B, 128 larger experts, patch 0014 +cap64) does not move it. A npl128 tile sweep on this model confirms 64 is the only +useful width (TILE8 -6.3%, TILE16 -3.2%, TILE32 -0.2%, TILE64 +0.7%, TILE96 -0.8%): +smaller tiles lose to grid/scheduling overhead and the FP4-MMA minimum width. + +Value banked default-on: (1) removes 0014's ~1.3% prefill cost by construction +(density-gated, not global); (2) auto-selects the small tile for col-tile-bound MoE +decode, reproducing 0014 cap64's tile=64 at npl128 by construction, so it preserves +the +4.8% on Qwen3-Coder-30B without the prefill cost; (3) prefill-safe and decode- +neutral on the SSM model, harmless where it does not help. Conservative by design: +at npl256 the qwen3coder decode density (16) equals the 256-expert prefill density +(16), indistinguishable to a pure-density gate, so density_max=8 forgoes 0014's ++2.3% @npl256 to keep 256-expert prefill safe; an ne12-aware refinement is future +work. + +LLAMA_MOE_MMQ_X (patch 0014) is KEPT as a manual override that, when > 0, forces the +old blunt global cap and bypasses the auto-select (explicit A/B knob). The auto- +select is the default; LLAMA_MOE_AUTO_TILE=0 restores exact stock mmq_x selection. +LLAMA_MOE_DECODE_TILE / LLAMA_MOE_DENSITY_MAX tune the small tile / threshold. + +Correctness: extends tests/test-backend-ops test_mul_mat_id with a ragged small-M +NVFP4/MXFP4 MoE decode-density gate (128 experts, top-8, m=768, k=2048, n in +{16,33,64,128,130,200,256,512} spanning the cap boundary and ragged token counts). +All 16 shapes pass CUDA-vs-CPU oracle on GB10 both default-on and with +LLAMA_MOE_AUTO_TILE=0; full MUL_MAT_ID suite 2/2 backends OK. Off the ids path +nothing changes (non-MoE mul_mat byte-identical to stock). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 100 ++++++++++++++++++++++++++++++------- + tests/test-backend-ops.cpp | 16 ++++++ + 2 files changed, 99 insertions(+), 17 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index cff608e..9718b12 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -4053,10 +4053,11 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + } + } + +-// [paged patch 0014] MoE token-tile (mmq_x) cap, read once from env LLAMA_MOE_MMQ_X. +-// Returns 0 when unset / non-positive => disabled (stock mmq_x selection, byte-identical). +-// On the MUL_MAT_ID grouped-GEMM path this caps the per-expert column-tile width toward the +-// low MoE-decode per-expert token density, raising tile fill + occupancy (see mul_mat_q_case). ++// [paged patch 0014] MoE token-tile (mmq_x) MANUAL cap, read once from env LLAMA_MOE_MMQ_X. ++// Returns 0 when unset / non-positive => disabled (fall through to the patch-0015 auto-select). ++// When > 0 it forces a blunt GLOBAL cap on the per-expert column-tile width for the MUL_MAT_ID ++// grouped-GEMM path (decode AND prefill), overriding the density-aware auto-select below. Kept ++// as an explicit override / A-B knob; the default path is now the auto-select. + static inline int ggml_cuda_moe_mmq_x_cap() { + static const int cap = []() -> int { + const char * s = getenv("LLAMA_MOE_MMQ_X"); +@@ -4065,6 +4066,43 @@ static inline int ggml_cuda_moe_mmq_x_cap() { + return cap; + } + ++// [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select knobs (DEFAULT-ON). ++// LLAMA_MOE_AUTO_TILE=0 disables the auto-select => exact stock mmq_x selection. ++static inline bool ggml_cuda_moe_auto_tile_enabled() { ++ static const bool en = []() -> bool { ++ const char * s = getenv("LLAMA_MOE_AUTO_TILE"); ++ return !(s && atoi(s) == 0); ++ }(); ++ return en; ++} ++// The small high-occupancy token-tile chosen for low-density (decode) MoE matmuls. Default 64: ++// the measured GB10 sweet spot (full per-expert fill with >=4x routing-imbalance headroom). ++static inline int ggml_cuda_moe_decode_tile() { ++ static const int t = []() -> int { ++ const char * s = getenv("LLAMA_MOE_DECODE_TILE"); ++ const int v = s ? atoi(s) : 0; ++ return v >= 8 ? v : 64; ++ }(); ++ return t; ++} ++// Per-expert token-density ceiling under which the small tile is selected. Default 8: the cap must ++// fire for decode but NOT for a prefill ubatch, and the per-expert density of each is ++// n_tokens*n_used/n_experts. For the standard n_ubatch=512, n_used=8 the prefill density is ++// 4096/n_experts (= 32 at 128 experts, 16 at 256 experts); decode at npl<=128 is <=1024/n_experts ++// (= 8 at 128 experts, 4 at 256). Default 8 sits strictly between the two for every n_experts in ++// [128,511], so it caps decode and leaves the prefill ubatch on the big 128 tile - whereas the old ++// tile/4 (=16) equalled the 256-expert prefill density and cratered its S_PP by ~2% (measured on ++// Qwen3.6-35B-A3B NVFP4). 8 also keeps >=8x fill headroom at tile 64 so an imbalanced expert ++// segment never splits into an extra col-tile. ++static inline int ggml_cuda_moe_density_max() { ++ static const int d = []() -> int { ++ const char * s = getenv("LLAMA_MOE_DENSITY_MAX"); ++ const int v = s ? atoi(s) : 0; ++ return v > 0 ? v : 8; ++ }(); ++ return d; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4076,25 +4114,53 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int mmq_x_max = get_mmq_x_max_host(cc); + const int mmq_y = get_mmq_y_host(cc); + +- // [paged patch 0014] expert-aware MoE token-tile (mmq_x) cap. +- // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are +- // tokens sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, +- // up to 128) in a single column-tile. At MoE decode the per-expert token density is low +- // (top-k of many experts: ~ne12*k/n_experts tokens/expert, e.g. ~8 at npl128 for +- // Qwen3-30B-A3B top-8/128), so each expert's single mmq_x-wide col-tile is mostly empty: +- // the MMA accumulator tile is mmq_x-wide at compile time and wastes throughput on the +- // padding columns while the larger y-tile lowers occupancy. Capping mmq_x toward the +- // per-expert density raises tile fill + occupancy with no extra weight reads (at +- // tokens/expert <= mmq_x there is still exactly one non-empty col-tile per expert; the +- // emptier tiles are skipped by the jt*mmq_x >= col_diff guard in the stream-k kernel). +- // Default (env unset or <= 0) = disabled => mmq_x selection is byte-identical to stock; +- // off the ids path the cap never applies. ++ // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON). ++ // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens ++ // sorted by expert; stock picks mmq_x to cover ncols_max (= ne12, the token count, up to 128) ++ // in a single column-tile, i.e. it MAXIMIZES the tile (128 on Blackwell) for the aggregate ++ // batch. But the tile is then applied PER EXPERT, and at MoE decode the per-expert token ++ // density is tiny (top-k of many experts), so each expert's single 128-wide col-tile is mostly ++ // empty: the MMA accumulator tile is mmq_x-wide at compile time and burns throughput on the ++ // padding columns while the larger y-tile lowers occupancy. vLLM's fused-MoE does the opposite ++ // (a small per-expert BLOCK_SIZE_M). We reproduce that here, host-side only, by picking a ++ // SMALLER mmq_x when - and only when - the per-expert density is low: ++ // ++ // ne_get_rows = args.ncols_dst = ne12 * n_expert_used (total token-expert assignments) ++ // n_experts = args.nchannels_x = ne02 ++ // n_active_est = min(n_experts, ne_get_rows) (upper bound on active experts) ++ // density = ceil(ne_get_rows / n_active_est) (avg tokens per active expert) ++ // ++ // Cap to the small tile (default 64) only when density <= density_max (default 8). 8 sits below ++ // every prefill-ubatch density and above every decode density for n_experts in [128,511] at the ++ // standard n_ubatch=512 (prefill 4096/n_experts, decode <=1024/n_experts), with >=8x fill headroom ++ // so a capped expert segment never splits a col-tile. Decode (per-expert density 4 at 256 experts, ++ // 8 at 128 experts @npl128) gets the fuller high-occupancy tile; the prefill ubatch (density 16 at ++ // 256 / 32 at 128 experts) stays ABOVE the threshold and keeps the big ++ // 128 compute tile - so unlike the blunt global cap (LLAMA_MOE_MMQ_X / patch 0014) this is ++ // prefill-safe by construction. The selection only ever picks an already-compiled, granularity- ++ // and shared-memory-validated mmq_x that the loop below would consider for a smaller batch; no ++ // new kernel. Off the ids path (expert_bounds == nullptr) nothing changes => non-MoE mul_mat ++ // and the gated f16/bf16 host-loop fallback stay byte-identical to stock. ++ // - LLAMA_MOE_MMQ_X= : manual blunt global cap, overrides the auto-select (patch 0014). ++ // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection). ++ // - LLAMA_MOE_DECODE_TILE=, LLAMA_MOE_DENSITY_MAX= : tune the tile / threshold. + int mmq_x_lim = mmq_x_max; + if (args.expert_bounds != nullptr) { + const int moe_cap = ggml_cuda_moe_mmq_x_cap(); + if (moe_cap > 0) { + const int cap = moe_cap < 8 ? 8 : moe_cap; + mmq_x_lim = cap < mmq_x_max ? cap : mmq_x_max; ++ } else if (ggml_cuda_moe_auto_tile_enabled()) { ++ const int64_t ne_get_rows = args.ncols_dst; ++ const int64_t n_experts = args.nchannels_x; ++ if (ne_get_rows > 0 && n_experts > 0) { ++ const int64_t n_active = ne_get_rows < n_experts ? ne_get_rows : n_experts; ++ const int64_t density = (ne_get_rows + n_active - 1) / n_active; ++ const int tile = ggml_cuda_moe_decode_tile(); ++ if (density <= (int64_t) ggml_cuda_moe_density_max() && tile < mmq_x_max) { ++ mmq_x_lim = tile; ++ } ++ } + } + } + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index c83e91f..62a0989 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -8603,6 +8603,22 @@ static std::vector> make_test_cases_eval() { + test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_MXFP4, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)); + test_cases.emplace_back(new test_mul_mat_id(GGML_TYPE_Q4_0, GGML_TYPE_F32, 32, 2, false, 2880, 32, 2880)); + ++ // [paged P0] MXFP4/NVFP4 qwen3-30b-a3b MoE decode-density regression gate for the expert- ++ // density-aware mmq_x auto-select (patch 0015). Real expert-FFN slice (128 experts, top-8, ++ // m=768, k=2048) so this exercises the exact grouped FP4-MMA mmq kernel the model runs. ++ // Per-expert token density = n*n_used/n_mats = n/16; cover the decode band (density 1/4/8/16 ++ // at n 16/64/128/256), ragged token counts (n 33/130/200: experts with 0/1/2 tokens, n not a ++ // multiple of the tile) where the tiny-M col-tiles change geometry and any masking can leak, ++ // and a prefill-density shape (n 512 => density 32) the auto-select must leave on the large ++ // 128 tile. n>=128 is exactly where stock picks mmq_x=128 and the auto-select picks 64, so the ++ // op-test (CPU oracle vs CUDA, deterministic) is the bit-exact regression gate for P1: it must ++ // pass with the auto-select on (default) and with LLAMA_MOE_AUTO_TILE=0 (stock selection). ++ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) { ++ for (int n : {16, 33, 64, 128, 130, 200, 256, 512}) { ++ test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 128, 8, false, 768, n, 2048)); ++ } ++ } ++ + for (ggml_type type_a : all_types) { + test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a))); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch new file mode 100644 index 000000000000..ca7e4040fb36 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0016-paged-dynamic-prefill-budget-continuous-batch.patch @@ -0,0 +1,191 @@ +From 02fa0473a9324b7e12f9b203d221cc4ac80cfd33 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 24 Jun 2026 10:11:48 +0200 +Subject: [PATCH] feat(paged): dynamic decode-first prefill-token budget (patch + 0016, continuous-batch P1) + +Supersede patch 0013's STATIC per-step prefill cap with a DYNAMIC, +decode-first token budget: the P1 of the token-granular continuous-batch +scheduler. POLICY change only inside update_slots(): no new slot states, no +batch-formation rewrite, zero libllama changes. llama-server already emits one +unified mixed prefill+decode batch per step (Phase 1 appends every ready decode +token unconditionally; Phase 2 fills prefill into the same batch). 0016 only +changes the COUNT of prefill tokens admitted per step. + +The budget block already sits AFTER Phase 1's decode fill, so batch.n_tokens +== D (the live decode load) is known there. Instead of 0013's constant +LLAMA_PREFILL_BUDGET (which ignores D, needs per-workload tuning, and lets one +long prompt monopolise the step), compute a dynamic budget: + + T = clamp(LLAMA_MAX_BATCH_TOKENS (default n_batch), n_ubatch, n_batch) + prefill_budget_step = max(n_ubatch, T - D) (leftover after decode, + auto-shrinks as decode load rises so the step never inflates past T) + prefill_cap_per_slot = min(T, ceil(0.04*n_ctx)) floored at n_ubatch, + pinned to n_batch when T == n_batch (LLAMA_PREFILL_CAP overrides) + +Phase 2's inner prompt-fill loop and outer admission break are bounded by +prefill_budget_step (across slots) and a new per-slot slot_prompt_added +counter; the n_batch hard ceiling stays as the compute bound. Decode is +structurally claimed first and never capped (Phase 1), so the decode-first +guarantee is free. + +DEFAULT-OFF BYTE-IDENTICAL: with all knobs unset, behaviour is byte-identical +to stock. The degenerate T == n_batch case is byte-identical to stock/0013 (the +determinism oracle). The legacy LLAMA_PREFILL_BUDGET path is preserved exactly +(honoured only when LLAMA_MAX_BATCH_TOKENS is unset), so 0013 is cleanly +subsumed. Orthogonal to LLAMA_KV_PAGED: pure scheduler policy, identical +decisions paged on or off. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tools/server/server-context.cpp | 107 +++++++++++++++++++++++++------- + 1 file changed, 85 insertions(+), 22 deletions(-) + +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index afcdebe..b8b8f00 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -3043,24 +3043,78 @@ private: + int32_t n_batch = llama_n_batch(ctx_tgt); + int32_t n_ubatch = llama_n_ubatch(ctx_tgt); + +- // PAGED serving lever (patch 0013): decoupled per-step prefill-token budget. +- // Analogue of vLLM's --max-num-batched-tokens. Stock llama-server caps the prompt +- // tokens ingested per update_slots() step at n_batch only; with cont_batching the +- // sampled decode tokens of every generating slot are appended FIRST, then prompt +- // tokens fill the batch up to n_batch. A long prompt therefore grabs an ~n_batch +- // chunk in a SINGLE compute-heavy step, spiking the inter-token latency of every +- // co-batched decoder (head-of-line jitter). LLAMA_PREFILL_BUDGET caps the prompt +- // tokens added per step independently of n_batch, splitting a long prefill across +- // more steps so in-flight decode keeps advancing smoothly. Default (env unset or +- // <=0) = disabled => stock behavior is byte-identical. Orthogonal to LLAMA_KV_PAGED +- // (this is a pure scheduler knob; works with paged off). +- int32_t n_prefill_budget = 0; // 0 = disabled (stock n_batch-only chunking) ++ // PAGED serving lever (patch 0016, supersedes 0013): dynamic decode-first ++ // per-step prefill-token budget (continuous-batch scheduler P1). llama-server ++ // already builds ONE mixed batch per update_slots() step: Phase 1 (just above) ++ // appended every generating slot's sampled token UNCONDITIONALLY, so at this point ++ // batch.n_tokens == D is the live decode load; Phase 2 (below) fills the remaining ++ // batch capacity with prompt tokens. Patch 0013 capped Phase 2 with a STATIC ++ // constant (LLAMA_PREFILL_BUDGET) that ignores D, needs per-workload tuning, and ++ // lets one long prompt monopolise the step. ++ // ++ // This computes a DYNAMIC budget instead, the vLLM v1 token-budget analogue: ++ // a single total per-step token budget T, decode claims its D tokens first ++ // (already in the batch), and prefill gets the leftover T - D distributed across ++ // waiting prompts with a per-slot chunk cap. As decode load D rises the prefill ++ // leftover auto-shrinks, so the step never inflates past T at any concurrency: ++ // the budget self-tunes across the npl range and across dense vs MoE without a ++ // hand-picked constant (the 161/333 tok/s GB10 decode ceiling is held tuning-free ++ // instead of via 0013's hand-tuned 256). Decode is structurally claimed first and ++ // never capped (Phase 1), so the decode-first guarantee is free here. ++ // ++ // LLAMA_MAX_BATCH_TOKENS (T) total per-step token budget (decode + prefill), ++ // default n_batch, clamped to [n_ubatch, n_batch] so ++ // the compute loop stays a single llama_decode and ++ // prefill keeps an n_ubatch floor of progress. ++ // LLAMA_PREFILL_CAP per-slot max prompt tokens per step (the ++ // long_prefill_token_threshold analogue), default ++ // min(T, ceil(0.04*n_ctx)) floored at n_ubatch, so ++ // one long prompt cannot eat the whole leftover. ++ // LLAMA_PREFILL_BUDGET legacy static cap (patch 0013); honoured ONLY when ++ // LLAMA_MAX_BATCH_TOKENS is unset, for back-compat. ++ // ++ // DEFAULT-OFF BYTE-IDENTICAL: with all three knobs unset, and in the degenerate ++ // T == n_batch case, behaviour is byte-identical to stock. At T == n_batch the ++ // dynamic leftover max(n_ubatch, n_batch - D) and the n_batch per-slot cap both ++ // reach the existing `batch.n_tokens < n_batch` ceiling at the SAME point, so no ++ // new bound fires (the determinism oracle). Orthogonal to LLAMA_KV_PAGED: pure ++ // scheduler policy, identical decisions with paged on or off. ++ const int32_t n_decode_in_batch = batch.size(); // D: Phase 1 appended D decode tokens above ++ int32_t prefill_budget_step = 0; // 0 = disabled (stock n_batch-only chunking) ++ int32_t prefill_cap_per_slot = 0; // 0 = disabled (no per-slot prompt-chunk cap) + { +- const char * env_pb = getenv("LLAMA_PREFILL_BUDGET"); +- if (env_pb) { ++ int32_t mbt = 0; ++ if (const char * env_mbt = getenv("LLAMA_MAX_BATCH_TOKENS")) { ++ mbt = atoi(env_mbt); ++ } ++ if (mbt > 0) { ++ // dynamic decode-first budget (P1): T clamped to [n_ubatch, n_batch] ++ int32_t T = std::min(n_batch, mbt); ++ T = std::max(T, n_ubatch); ++ // leftover after decode, floored at n_ubatch so prefill never fully starves ++ prefill_budget_step = std::max(n_ubatch, T - n_decode_in_batch); ++ // per-slot prompt-chunk cap (long_prefill_token_threshold analogue) ++ int32_t cap = 0; ++ if (const char * env_cap = getenv("LLAMA_PREFILL_CAP")) { ++ cap = atoi(env_cap); ++ } ++ if (cap <= 0) { ++ const int32_t pct4 = (n_ctx + 24) / 25; // ceil(0.04 * n_ctx) ++ cap = std::min(T, std::max(n_ubatch, pct4)); ++ } ++ cap = std::min(n_batch, std::max(n_ubatch, cap)); ++ // at T == n_batch the leftover and cap both reach the n_batch ceiling ++ // together; pin the cap to n_batch so this case stays byte-identical ++ if (T >= n_batch) { ++ cap = n_batch; ++ } ++ prefill_cap_per_slot = cap; ++ } else if (const char * env_pb = getenv("LLAMA_PREFILL_BUDGET")) { ++ // legacy static budget (patch 0013), kept for back-compat when the ++ // dynamic knob is unset: a constant per-step prefill cap, no per-slot cap + const int v = atoi(env_pb); + if (v > 0) { +- n_prefill_budget = std::min(n_batch, std::max(1, v)); ++ prefill_budget_step = std::min(n_batch, std::max(1, v)); + } + } + } +@@ -3509,11 +3563,18 @@ private: + const auto & spans = slot.task->params.message_spans; + const auto last_user_pos = spans.last_user_message_pos(); + ++ // (patch 0016) per-slot prompt tokens added this step, for the per-slot ++ // chunk cap (resets each slot); n_batch stays the hard compute ceiling ++ int32_t slot_prompt_added = 0; ++ + // add prompt tokens for processing in the current batch +- // (patch 0013) also stop once the per-step prefill budget is spent, so a long +- // prompt is split across more steps and leaves batch room for co-batched decode ++ // (patch 0016) also stop once (a) the dynamic per-step prefill budget ++ // (the T - D leftover) is spent across all slots, or (b) this slot's ++ // per-slot chunk cap is hit, so a long prompt is split across more steps ++ // and leaves batch room for co-batched decode of the other slots + while (slot.prompt.n_tokens() < slot.task->n_tokens() && batch.size() < n_batch && +- (n_prefill_budget == 0 || n_prompt_budgeted < n_prefill_budget)) { ++ (prefill_budget_step == 0 || n_prompt_budgeted < prefill_budget_step) && ++ (prefill_cap_per_slot == 0 || slot_prompt_added < prefill_cap_per_slot)) { + // get next token to process + llama_token cur_tok = input_tokens[slot.prompt.n_tokens()]; + if (cur_tok == LLAMA_TOKEN_NULL) { +@@ -3538,7 +3599,8 @@ private: + slot.prompt.tokens.push_back(cur_tok); + + slot.n_prompt_tokens_processed++; +- n_prompt_budgeted++; // (patch 0013) count toward the per-step prefill budget ++ n_prompt_budgeted++; // (patch 0016) toward the dynamic per-step prefill budget ++ slot_prompt_added++; // (patch 0016) toward this slot's per-step chunk cap + + // stop the prompt batch exactly before a user message + if (spans.is_user_start(slot.prompt.n_tokens())) { +@@ -3624,9 +3686,10 @@ private: + if (!slot_batched) { + slot_batched = &slot; + } +- // (patch 0013) stop adding prompts once the per-step prefill budget is spent, +- // leaving the remaining batch capacity for co-batched decode of other slots +- if (n_prefill_budget > 0 && n_prompt_budgeted >= n_prefill_budget) { ++ // (patch 0016) stop admitting prompts once the dynamic per-step prefill ++ // budget (the T - D leftover) is spent, leaving the remaining batch ++ // capacity for co-batched decode of the other slots ++ if (prefill_budget_step > 0 && n_prompt_budgeted >= prefill_budget_step) { + add_ok = false; + } + }); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0017-fp4-gemm-decode-tile-tune.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0017-fp4-gemm-decode-tile-tune.patch new file mode 100644 index 000000000000..19960ed81958 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0017-fp4-gemm-decode-tile-tune.patch @@ -0,0 +1,245 @@ +From 089f78d2a2c04465a566d499dbe0a67c008435a8 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 24 Jun 2026 19:56:05 +0200 +Subject: [PATCH] feat(paged): FP4 decode GEMM track-B P0 gate + default-off + occupancy instrumentation (patch 0017) + +Track B targets the dense NVFP4 weight GEMM (~59% of the GB10 decode step). This lands the P0 +bit-exact parity gate and the P1 occupancy levers (default-off / byte-identical) and records the +honest P1 result: the cheap host/occupancy tuning does NOT lift decode_agg on GB10 (sm_121) - the +kill-gate tripped - so nothing is enabled by default. + +P0 gate (tests/test-backend-ops.cpp): NVFP4/MXFP4 dense decode-shape MUL_MAT cases at the weight- +row tiling boundary (m in {2048,1600,2050} = exact + ragged vs mmq_y 64/128, n in {32,128} = decode +M, k=2048), so the bit-exact CPU-vs-CUDA oracle covers the mmq_y / min-blocks paths. Green at +default and with every lever on: MUL_MAT 1115/1115, MUL_MAT_ID 805/805, NVFP4 0 fail. + +P1 levers (ggml/src/ggml-cuda/mmq.cuh), all default-off => default build byte-identical to stock: + - GGML_CUDA_FP4_MMQ_Y (default 128): type-aware get_mmq_y_host/device plumbing for an NVFP4 + weight-row tile override. mmq_y is rigidly nwarps*tile_C::I (=8*16=128, the mmq.cuh static_ + assert), so mmq_y<128 also needs nwarps-down (a warp-remap through the shared vec_dot/loader), + left as the P2 kernel change; the host/device plumbing is in place and inert. + - GGML_CUDA_FP4_MINBLOCKS (default 1): NVFP4-only __launch_bounds__ min-resident-CTAs lever + (register-cap the FP4-MMA kernel so >1 CTA co-resides) - the bounded occupancy probe. + - GGML_CUDA_FP4_DENSE_MMQ_X (env, default off): dense col-tile re-read occupancy diagnostic. + +Measured GB10 (llama-batched-bench -fa on -npp 128 -ntg 128 -npl 32,128), decode_agg (S_TG): + DENSE q36-27b-nvfp4 @npl128: P0 149.5 -> MINBLOCKS=2 147.9 (-1.1%) -> DENSE_MMQ_X=64 144.3 + (-3.5%) -> =32 141.7 (-5.2%). Every occupancy probe regresses. + MoE q36-35b-a3b-nvfp4 @npl128: stock 336.3, MINBLOCKS=2 337.7 (+0.4%, noise), TILE16 324.0 + (-3.7%), TILE8 316.6 (-5.9%). mmq_x-down regresses (reproduces patch 0015; GDN/BW-bound). + +nsys (kill-gate evidence): the decode FP4 GEMM mul_mat_q went 2.782s -> 3.025s +(avg 608us -> 661us, +8.7% slower) under MINBLOCKS=2 - register-capping spills, so occupancy did +not usefully rise. Verdict: the dense M=128 tile is already weight-read/one-read-optimal at +mmq_x=128, NOT occupancy-starved via the cheap levers; the only untested lever is the structural +mmq_y-down (nwarps=4 warp-remap), deferred to P2. Bit-exact gate holds throughout. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cuh | 85 ++++++++++++++++++++++++++++++++++---- + tests/test-backend-ops.cpp | 16 +++++++ + 2 files changed, 92 insertions(+), 9 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index 9718b12..b53e38a 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -140,7 +140,24 @@ static constexpr __device__ int get_mmq_x_max_device() { + #endif // defined(AMD_MFMA_AVAILABLE) || defined(TURING_MMA_AVAILABLE) || defined(AMD_WMMA_AVAILABLE) + } + +-static int get_mmq_y_host(const int cc) { ++// [paged patch 0017 / track B] Dense NVFP4 decode mmq_y (weight-row tile) override. ++// mmq_y tiles the N (weight-row) dimension of the FP4-MMA weight GEMM. Lowering it raises the ++// number of resident CTAs (smaller per-CTA shared footprint + smaller per-thread accumulator) to ++// hide LPDDR5x weight-load latency at the M=128 decode tile, WITHOUT re-reading weights: every ++// weight row lives in exactly one row-tile, so total weight traffic is unchanged (bandwidth- ++// neutral) - the dense-decode occupancy lever from FP4_GEMM_SCOPE_B.md s3/s4.1. mmq_y is a PURE ++// N-row tiling knob: the per-output reduction over K is identical for any mmq_y, so the result ++// stays BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4 decode shapes). Default 128 == exact ++// stock behaviour (a default build is byte-identical to stock); build -DGGML_CUDA_FP4_MMQ_Y=64 ++// (or 96) to enable the tune. Applies ONLY to NVFP4 on Blackwell; every other type/arch untouched. ++#ifndef GGML_CUDA_FP4_MMQ_Y ++#define GGML_CUDA_FP4_MMQ_Y 128 ++#endif ++ ++static int get_mmq_y_host(const int cc, const ggml_type type = GGML_TYPE_COUNT) { ++ if (GGML_CUDA_FP4_MMQ_Y != 128 && type == GGML_TYPE_NVFP4 && blackwell_mma_available(cc)) { ++ return GGML_CUDA_FP4_MMQ_Y; ++ } + return GGML_CUDA_CC_IS_AMD(cc) ? (GGML_CUDA_CC_IS_RDNA1(cc) ? 64 : 128) : + ((GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ? 128 : 64); + } +@@ -154,7 +171,13 @@ if (type == GGML_TYPE_NVFP4 || type == GGML_TYPE_MXFP4) { + return MMQ_ITER_K; + } + ++template + static constexpr __device__ int get_mmq_y_device() { ++#if defined(BLACKWELL_MMA_AVAILABLE) ++ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MMQ_Y != 128) { ++ return GGML_CUDA_FP4_MMQ_Y; ++ } ++#endif // defined(BLACKWELL_MMA_AVAILABLE) + #if defined(GGML_USE_HIP) + #if defined(RDNA1) + return 64; +@@ -170,6 +193,28 @@ static constexpr __device__ int get_mmq_y_device() { + #endif // defined(GGML_USE_HIP) + } + ++// [paged patch 0017 / track B] Dense NVFP4 decode occupancy lever: min resident CTAs per SM. ++// The FP4-MMA mul_mat_q is REGISTER-bound to 1 CTA/SM (__launch_bounds__(256,1) => ~255 regs/thread ++// => one resident block, the under-occupancy that strands the kernel at ~3% of FP4 peak at M=128). ++// Raising the __launch_bounds__ min-blocks operand register-caps the compiler so N CTAs co-reside, ++// hiding LPDDR5x weight-load latency by CTA-parallelism (the scope s4.1 occupancy goal) WITHOUT a ++// structural mmq_y/nwarps change and WITHOUT extra weight reads (each weight tile still read once). ++// Register allocation cannot change results => BIT-EXACT (gated by test-backend-ops MUL_MAT NVFP4). ++// Default 1 == exact stock behaviour (byte-identical); build -DGGML_CUDA_FP4_MINBLOCKS=2 to enable. ++// Applies ONLY to NVFP4 on Blackwell; every other type/arch keeps the stock min-blocks. ++#ifndef GGML_CUDA_FP4_MINBLOCKS ++#define GGML_CUDA_FP4_MINBLOCKS 1 ++#endif ++template ++static constexpr __device__ int mmq_get_min_blocks_device(const int stock) { ++#if defined(BLACKWELL_MMA_AVAILABLE) ++ if (type == GGML_TYPE_NVFP4 && GGML_CUDA_FP4_MINBLOCKS != 1) { ++ return GGML_CUDA_FP4_MINBLOCKS; ++ } ++#endif // defined(BLACKWELL_MMA_AVAILABLE) ++ return stock; ++} ++ + // Decouple shared memory tile sizes from WARP_SIZE to allow for different warp sizes. + // The K dimension of the tiles has either, + // 1*MMQ_TILE_NE_K==32 (always for TILE_Y_K) or 2*MMQ_TILE_NE_K==64 (typically for TILE_X_K), +@@ -3454,7 +3499,7 @@ static __device__ __forceinline__ void mul_mat_q_process_tile( + constexpr int warp_size = ggml_cuda_get_physical_warp_size(); + constexpr int nwarps = mmq_get_nwarps_device(); + constexpr int qk = ggml_cuda_type_traits::qk; +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + constexpr load_tiles_mmq_t load_tiles = mmq_type_traits::load_tiles; + + extern __shared__ int data_mul_mat_q[]; +@@ -3531,13 +3576,13 @@ static __device__ __forceinline__ void mul_mat_q_process_tile( + template + #if defined(GGML_USE_HIP) + #if defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN) +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(2)) + #endif // defined(RDNA4) || defined(RDNA3) || defined(RDNA2) || defined(CDNA) || defined(GCN) + #else + #if __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 1) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(1)) + #else +- __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), 2) ++ __launch_bounds__(ggml_cuda_get_physical_warp_size()*mmq_get_nwarps_device(), mmq_get_min_blocks_device(2)) + #endif // __CUDA_ARCH__ >= GGML_CUDA_CC_VOLTA + #endif // defined(GGML_USE_HIP) + static __global__ void mul_mat_q( +@@ -3558,7 +3603,7 @@ static __global__ void mul_mat_q( + constexpr int warp_size = ggml_cuda_get_physical_warp_size(); + + constexpr int qk = ggml_cuda_type_traits::qk; +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + + const uint32_t nty = (nrows_x + mmq_y - 1) / mmq_y; // Number of tiles y + +@@ -3790,7 +3835,7 @@ static __global__ void mul_mat_q_stream_k_fixup( + float * __restrict__ tmp_last_tile, const uint3 blocks_per_ne00, const int nrows_x, const int ncols_dst, + const int stride_col_dst, const uint3 nchannels_y, const int stride_channel_dst, const uint3 nsamples_y, + const int stride_sample_dst, const uint3 ntx) { +- constexpr int mmq_y = get_mmq_y_device(); ++ constexpr int mmq_y = get_mmq_y_device(); + constexpr int qk = ggml_cuda_type_traits::qk; + constexpr int ITER_K = get_iter_k(type); + constexpr int blocks_per_iter = ITER_K / qk; +@@ -3947,7 +3992,7 @@ static void launch_mul_mat_q(ggml_backend_cuda_context & ctx, const mmq_args & a + const int nsm = ggml_cuda_info().devices[id].nsm; + const int warp_size = ggml_cuda_info().devices[id].warp_size; + const int nwarps = mmq_get_nwarps_host(cc, warp_size); +- const int mmq_y = get_mmq_y_host(cc); ++ const int mmq_y = get_mmq_y_host(cc, type); + + const dim3 block_dims(warp_size, nwarps, 1); + +@@ -4103,6 +4148,21 @@ static inline int ggml_cuda_moe_density_max() { + return d; + } + ++// [paged patch 0017 / track B] DENSE NVFP4 decode mmq_x re-read occupancy DIAGNOSTIC (env, default off). ++// GGML_CUDA_FP4_DENSE_MMQ_X= caps the dense (non-MoE) NVFP4 col-tile to , splitting the M=128 ++// decode ubatch into ceil(128/n) col-tiles. Each col-tile re-reads the full weight set (fatal cost ++// in the BW-bound regime) but multiplies resident CTAs. This is the scope s4.1 A/B probe: if ++// decode_agg RISES with cap=64 despite the 2x weight read, occupancy is badly broken (the kernel is ++// compute/occupancy-bound, so mmq_y-down / min-blocks has large upside); if it FALLS, the tile is ++// already bandwidth-saturated and the occupancy ceiling is lower. Unset/<=0 => stock selection. ++static inline int ggml_cuda_fp4_dense_mmq_x_cap() { ++ static const int c = []() -> int { ++ const char * s = getenv("GGML_CUDA_FP4_DENSE_MMQ_X"); ++ return s ? atoi(s) : 0; ++ }(); ++ return c; ++} ++ + template + void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cudaStream_t stream) { + const int id = ggml_cuda_get_device(); +@@ -4112,7 +4172,7 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + const int nwarps = mmq_get_nwarps_host(cc, warp_size); + + const int mmq_x_max = get_mmq_x_max_host(cc); +- const int mmq_y = get_mmq_y_host(cc); ++ const int mmq_y = get_mmq_y_host(cc, type); + + // [paged patch 0015] expert-density-aware MoE token-tile (mmq_x) auto-select (DEFAULT-ON). + // On the MUL_MAT_ID grouped-GEMM path (expert_bounds != nullptr) the GEMM columns are tokens +@@ -4145,6 +4205,13 @@ void mul_mat_q_case(ggml_backend_cuda_context & ctx, const mmq_args & args, cuda + // - LLAMA_MOE_AUTO_TILE=0 : disable the auto-select (exact stock selection). + // - LLAMA_MOE_DECODE_TILE=, LLAMA_MOE_DENSITY_MAX= : tune the tile / threshold. + int mmq_x_lim = mmq_x_max; ++ if (args.expert_bounds == nullptr && type == GGML_TYPE_NVFP4) { ++ // dense NVFP4 decode mmq_x re-read occupancy diagnostic (see ggml_cuda_fp4_dense_mmq_x_cap). ++ const int cap = ggml_cuda_fp4_dense_mmq_x_cap(); ++ if (cap > 0 && cap < mmq_x_max) { ++ mmq_x_lim = cap < 8 ? 8 : cap; ++ } ++ } + if (args.expert_bounds != nullptr) { + const int moe_cap = ggml_cuda_moe_mmq_x_cap(); + if (moe_cap > 0) { +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index f219309..291c275 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -8591,6 +8591,22 @@ static std::vector> make_test_cases_eval() { + } + } + ++ // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate. ++ // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the ++ // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a ++ // smaller mmq_y must stay BIT-EXACT (identical per-output reduction over K) - this gate proves ++ // it. m = weight rows (N, tiled by mmq_y): 2048 (exact at mmq_y 64 & 128), 1600 (ragged vs 128), ++ // 2050 (ragged vs both 64 & 128 -> exercises the need_check last-row-tile at both). n = decode ++ // token count M = 32 and 128 (the scope decode shapes, tiled by mmq_x). k = 2048 hidden. Must ++ // pass with the default build (mmq_y=128) AND a mmq_y=64 build, CUDA-vs-CPU oracle, bit-exact. ++ for (ggml_type type_a : {GGML_TYPE_MXFP4, GGML_TYPE_NVFP4}) { ++ for (int64_t m : {2048, 1600, 2050}) { ++ for (int64_t n : {32, 128}) { ++ test_cases.emplace_back(new test_mul_mat(type_a, GGML_TYPE_F32, m, n, 2048, {1, 1}, {1, 1})); ++ } ++ } ++ } ++ + for (ggml_type type_a : all_types) { + test_cases.emplace_back(new test_mul_mat_id(type_a, GGML_TYPE_F32, 4, 2, false, 64, 16, 3*ggml_blck_size(type_a))); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch new file mode 100644 index 000000000000..2db002a6617b --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0018-qwen35-ssm-decode-inplace-state.patch @@ -0,0 +1,349 @@ +From 17f16e8f6d8dbc689d5151c44759792d683c957b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 00:44:13 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet in-place SSM state + write-back (patch 0018) + +Decode on the Qwen3.6 hybrid-SSM models (arch qwen35, 48 gated-DeltaNet : +16 full-attention layers) was dominated by recurrent-state plumbing, not the +FP4 GEMM. Per SSM layer per step the fused gated_delta_net op wrote its new +recurrent state into graph scratch, then a separate ggml_cpy persisted it into +the recurrent-state cache. nsys attributed 18.9% of decode GPU time to that +~225 MB/copy D2D memcpy (1584 ops, 356 GB over the A2 decompose window). + +This mirrors vLLM fused_recurrent_gated_delta_rule (state kept in place): +ggml_gated_delta_net_inplace writes the final recurrent state directly into the +active sequences contiguous cache slot (at kv_head), removing the copy-back. The +op output then carries only the attention scores; the SSM arithmetic is +unchanged (bit-identical greedy output vs the copy-back baseline). + +- new op builder ggml_gated_delta_net_inplace (src[6] = state_dst cache view) +- CUDA + CPU honor src[6]; final-state (K==1, keep_rs off) write redirected there +- delta-net-base build_recurrent_attn uses it on the fused decode/prefill path, + dropping the ggml_cpy; rollback (n_rs_seq>0) path unchanged + +Measured (q36-27b-nvfp4, decode_agg S_TG, npp128 ntg128, -fa on, paged on): + npl 32 : 113.74 -> 136.39 t/s (+19.9 percent) + npl 128: 146.23 -> 180.53 t/s (+23.5 percent, = predicted copy-removal ceiling) +MoE q36-35b-a3b-nvfp4: npl128 313.36 -> 372.62 t/s (+18.9 percent). +nsys D2D memcpy bucket 18.9 -> 0.23 percent (356 -> 2.93 GB). vLLM share +(391 @128) 37.4 -> 46.2 percent. get_rows state gather (now 18.8 percent) is the +next lever. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 14 ++++++ + ggml/src/ggml-cpu/ops.cpp | 13 ++++- + ggml/src/ggml-cuda/gated_delta_net.cu | 39 ++++++++++----- + ggml/src/ggml.c | 68 +++++++++++++++++++++++++++ + src/models/delta-net-base.cpp | 30 ++++++++++++ + 5 files changed, 152 insertions(+), 12 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index 823f5a9..4e7ab32 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2579,6 +2579,20 @@ extern "C" { + struct ggml_tensor * state, + int64_t K); + ++ // same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written ++ // in place into state_dst (a view into the recurrent-state cache) instead of being appended to ++ // the op output, eliminating the per-step state copy-back during decode. state_dst must be a ++ // contiguous [S_v*S_v*H, n_seqs] view (per-seq stride == dense state size). ++ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst); ++ + // custom operators + + typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata); +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index 63c07a2..9457add 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -10600,6 +10600,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + ggml_tensor * src_g = dst->src[3]; + ggml_tensor * src_beta = dst->src[4]; + ggml_tensor * src_state = dst->src[5]; ++ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place final-state write-back target + + const int64_t S_v = src_v->ne[0]; + const int64_t H = src_v->ne[1]; +@@ -10660,6 +10661,16 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + + const float scale = 1.0f / sqrtf((float) S_v); + ++ // when src_state_dst is provided (in-place decode write-back) the final state is written ++ // directly into the persistent cache view, removing the separate state copy-back node. ++ float * inplace_state_base = nullptr; ++ if (src_state_dst != nullptr) { ++ GGML_ASSERT(K == 1); ++ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float)); ++ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float)); ++ inplace_state_base = (float *) src_state_dst->data; ++ } ++ + for (int64_t ir = ir0; ir < ir1; ++ir) { + const int64_t iv1 = ir % H; // head_index + const int64_t iv3 = ir / H; // sequence +@@ -10674,7 +10685,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + // For K>1, work in scratch and copy out per-token when the slot is in range. + float * s_out = (K > 1) + ? state_work +- : state_out_base + (iv3 * H + iv1) * S_v * S_v; ++ : (inplace_state_base ? inplace_state_base : state_out_base) + (iv3 * H + iv1) * S_v * S_v; + + // copy input state into the working buffer and operate in-place + // state layout [S_v, S_v, H, n_seqs]: seq iv3 starts at iv3 * state_seq_stride. +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index a547360..61a2b91 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -25,7 +25,8 @@ gated_delta_net_cuda(const float * q, + const uint3 neqk1_magic, + const uint3 rq3_magic, + float scale, +- int K) { ++ int K, ++ float * state_dst) { + const uint32_t h_idx = blockIdx.x; + const uint32_t sequence = blockIdx.y; + // each warp owns one column, using warp-level primitives to reduce across rows +@@ -37,7 +38,10 @@ gated_delta_net_cuda(const float * q, + + const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs; + float * attn_data = dst; +- float * state = dst + attn_score_elems; ++ // when state_dst is provided (in-place decode write-back) the final recurrent state is written ++ // directly into the persistent cache view instead of being appended to the op output; this ++ // eliminates the per-layer per-step D2D state copy-back. Only used when keep_rs_t == false. ++ float * state = (state_dst != nullptr) ? state_dst : (dst + attn_score_elems); + + // input state holds s0 only: [S_v, S_v, H, n_seqs] — seq stride is D = H * S_v * S_v. + // output state layout (per-slot D * n_seqs) — same per-(seq,head) offset as before. +@@ -171,7 +175,7 @@ template + static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, + const float * g_d, const float * b_d, const float * s_d, +- float * dst_d, ++ float * dst_d, float * state_dst_d, + int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sv1, int64_t sv2, int64_t sv3, +@@ -195,26 +199,26 @@ static void launch_gated_delta_net( + ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + case 32: + ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + case 64: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + } + case 128: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); + break; + } + default: +@@ -230,6 +234,7 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + ggml_tensor * src_g = dst->src[3]; + ggml_tensor * src_beta = dst->src[4]; + ggml_tensor * src_state = dst->src[5]; ++ ggml_tensor * src_state_dst = dst->src[6]; // optional in-place state write-back target + + GGML_TENSOR_LOCALS(int64_t, neq, src_q, ne); + GGML_TENSOR_LOCALS(size_t , nbq, src_q, nb); +@@ -260,6 +265,15 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + const float * s_d = (const float *) src_state->data; + float * dst_d = (float *) dst->data; + ++ float * state_dst_d = nullptr; ++ if (src_state_dst != nullptr) { ++ // in-place final-state cache view: per-seq stride must be the dense state size D = S_v*S_v*H ++ GGML_ASSERT(src_state_dst->type == GGML_TYPE_F32); ++ GGML_ASSERT(src_state_dst->nb[0] == sizeof(float)); ++ GGML_ASSERT(src_state_dst->nb[1] == (size_t) S_v * S_v * H * sizeof(float)); ++ state_dst_d = (float *) src_state_dst->data; ++ } ++ + GGML_ASSERT(ggml_is_contiguous_rows(src_q)); + GGML_ASSERT(ggml_is_contiguous_rows(src_k)); + GGML_ASSERT(ggml_is_contiguous_rows(src_v)); +@@ -288,23 +302,26 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + const int K = ggml_get_op_params_i32(dst, 0); + const bool keep_rs = K > 1; + ++ // in-place write-back is only valid for the single-snapshot (final-state) case ++ GGML_ASSERT(state_dst_d == nullptr || !keep_rs); ++ + if (kda) { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } + } else { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index adbe52b..b8d34bf 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -6285,6 +6285,74 @@ struct ggml_tensor * ggml_gated_delta_net( + return result; + } + ++// ggml_gated_delta_net_inplace ++// ++// Same recurrence as ggml_gated_delta_net with K == 1, but the final recurrent state is written ++// in place into `state_dst` (a view into the persistent recurrent-state cache) instead of being ++// appended to the op output. This removes the per-layer per-step D2D state copy-back during decode. ++// The op output holds ONLY the attention scores; the state region is still allocated (unused) so ++// the attention-output view layout is identical to ggml_gated_delta_net. ++struct ggml_tensor * ggml_gated_delta_net_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst) { ++ GGML_ASSERT(ggml_is_contiguous_rows(q)); ++ GGML_ASSERT(ggml_is_contiguous_rows(k)); ++ GGML_ASSERT(ggml_is_contiguous_rows(v)); ++ GGML_ASSERT(ggml_is_contiguous(g)); ++ GGML_ASSERT(ggml_is_contiguous(beta)); ++ GGML_ASSERT(ggml_is_contiguous(state)); ++ ++ GGML_ASSERT(q->type == GGML_TYPE_F32); ++ GGML_ASSERT(k->type == GGML_TYPE_F32); ++ GGML_ASSERT(v->type == GGML_TYPE_F32); ++ GGML_ASSERT(g->type == GGML_TYPE_F32); ++ GGML_ASSERT(beta->type == GGML_TYPE_F32); ++ GGML_ASSERT(state->type == GGML_TYPE_F32); ++ GGML_ASSERT(state_dst != NULL); ++ GGML_ASSERT(state_dst->type == GGML_TYPE_F32); ++ ++ const int64_t S_v = v->ne[0]; ++ const int64_t H = v->ne[1]; ++ const int64_t n_tokens = v->ne[2]; ++ const int64_t n_seqs = v->ne[3]; ++ ++ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v); ++ GGML_ASSERT(beta->ne[0] == 1); ++ ++ GGML_ASSERT(state->ne[0] == S_v); ++ GGML_ASSERT(state->ne[1] == S_v); ++ GGML_ASSERT(state->ne[2] == H); ++ GGML_ASSERT(state->ne[3] == n_seqs); ++ ++ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs] ++ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H); ++ GGML_ASSERT(state_dst->ne[1] >= n_seqs); ++ GGML_ASSERT(state_dst->nb[0] == sizeof(float)); ++ ++ const int64_t state_rows = S_v * n_seqs; // K == 1 ++ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 }; ++ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne); ++ ++ ggml_set_op_params_i32(result, 0, 1); // K == 1 ++ ++ result->op = GGML_OP_GATED_DELTA_NET; ++ result->src[0] = q; ++ result->src[1] = k; ++ result->src[2] = v; ++ result->src[3] = g; ++ result->src[4] = beta; ++ result->src[5] = state; ++ result->src[6] = state_dst; ++ ++ return result; ++} ++ + //////////////////////////////////////////////////////////////////////////////// + + struct ggml_hash_set ggml_hash_set_new(size_t size) { +diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp +index ad9ce77..26a718b 100644 +--- a/src/models/delta-net-base.cpp ++++ b/src/models/delta-net-base.cpp +@@ -546,6 +546,36 @@ ggml_tensor * llm_build_delta_net_base::build_recurrent_attn( + const bool keep = cparams.n_rs_seq > 0; + + if (!keep) { ++ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch; ++ ++ if (fused) { ++ // In-place state write-back: the fused gated-DeltaNet op writes the new recurrent state ++ // directly into the persistent cache slot for the active sequences (a contiguous block ++ // at kv_head), eliminating the per-layer per-step ~full-state D2D copy-back that ++ // dominated decode. The op output then carries only the attention scores. ++ ggml_tensor * state_dst = ggml_view_2d(ctx0, ssm_states_all, hparams.n_embd_s(), n_seqs, ++ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all)); ++ ++ ggml_tensor * result = ggml_gated_delta_net_inplace(ctx0, q, k, v, g, b, s, state_dst); ++ if (n_seq_tokens == 1) { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il); ++ } else { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il); ++ } ++ ++ ggml_tensor * output = ggml_view_4d(ctx0, result, ++ S_v, H_v, n_seq_tokens, n_seqs, ++ ggml_row_size(result->type, S_v), ++ ggml_row_size(result->type, S_v * H_v), ++ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0); ++ cb(output, "attn_output", il); ++ ++ // the state write is a side effect of the op; pull the op into the graph via the output ++ ggml_build_forward_expand(gf, output); ++ ++ return output; ++ } ++ + auto attn_out = build_delta_net(q, k, v, g, b, s, il); + ggml_tensor * output = attn_out.first; + ggml_tensor * new_state = attn_out.second; +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch new file mode 100644 index 000000000000..a7e653d70c6a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0019-qwen35-ssm-decode-fused-gather.patch @@ -0,0 +1,583 @@ +From 46d7dd80bbce7f3c1dbf9363d6527c8c9b687a6b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 01:45:02 +0200 +Subject: [PATCH] feat(paged): qwen35 SSM decode fused recurrent-state gather + (patch 0019) + +Step 2 of the SSM decode-throughput work. After Step 1 (in-place state +write-back, patch 0018) the largest non-GEMM decode bucket was the recurrent- +state get_rows gather (18.8% of decode GPU time): build_rs materialized each +sequence's prior state into a contiguous scratch via ggml_get_rows before the +gated-DeltaNet op read it. + +This eliminates that materialization, mirroring ggml_ssm_scan's ids source. +ggml_gated_delta_net_inplace_ids takes the FULL recurrent-state cache plus the +s_copy ids (src[5] = full cache, src[7] = ids, op_param[1] = rs_head) and reads +each sequence's prior state directly from cache[ids[seq]]. Combined with Step 1's +in-place write the op now reads AND writes the cache directly: no recurrent-state +materialization at all. build_recurrent_attn feeds the full cache + ids through +the build_rs get_state_rows lambda exactly like mamba-base, keeping the rs_zero +clear and the extra-states copy around the op. + +Race-free by construction on CUDA. In-place write plus an ids read of the same +cache is only safe when read slot == write slot; s_copy is identity +(rs_head + s) for stable continuing sequences (the whole AR decode path) but can +remap on reorder or rs_zero (e.g. multiple new sequences in one prefill ubatch). +The recurrence kernel handles both per (seq, head) block on device: identity +sequences read s0 in place from the destination slot (the kernel loads all of s0 +into registers before writing, so reading and writing the same slot is safe), +and non-identity sequences read from a disjoint scratch that a small gather +kernel copies from cache[ids[seq]] first, so the recurrence never reads a slot +another block writes. The CPU op mirrors this (host identity check + a serial +gather in the dispatcher). ids stays a device pointer (read only in-kernel; it is +device-resident at op-execute time). Bit-identical to the get_rows path in every +case. + +- new builder ggml_gated_delta_net_inplace_ids; CUDA gather kernel + (gdn_gather_nonident) + per-block read-base select in gated_delta_net_cuda; + CPU identity guard + serial gather fallback in the dispatcher +- delta-net-base build_recurrent_attn gains a gather-free overload; qwen35 and + qwen35moe drop the pre-gather. qwen3next, kimi-linear, the non-fused path and + the rollback (n_rs_seq > 0) path are unchanged. + +Measured (decode_agg S_TG, npp128 ntg128, -fa on, paged on, fusion off): + dense q36-27b-nvfp4 : npl 32 137.64 -> 170.68 (+24.0 percent) + npl 128 186.25 -> 256.57 (+37.8 percent, 47.6 -> 65.6 percent of vLLM 391) + MoE q36-35b-a3b-nvfp4: npl 32 299.68 -> 366.69 (+22.4 percent) + npl 128 409.30 -> 553.63 (+35.3 percent) +Greedy (--temp 0 --seed 1) llama-completion bit-identical vs the Step-1 build +(dense model text md5 match, MoE byte-identical, step2 run1 == run2). nsys +k_get_rows_float bucket 18.8 -> 0.7 percent; the new gdn_gather_nonident kernel +is 1.7 percent (no-op at decode, median 1.2 us). The residual decode gap to vLLM +is now the FP4 GEMM (~48 percent of decode), a separate kernel track. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 17 ++++++ + ggml/src/ggml-cpu/ops.cpp | 49 ++++++++++++++- + ggml/src/ggml-cuda/gated_delta_net.cu | 85 ++++++++++++++++++++++---- + ggml/src/ggml.c | 76 +++++++++++++++++++++++ + src/models/delta-net-base.cpp | 63 ++++++++++++++++++++ + src/models/models.h | 13 ++++ + src/models/qwen35.cpp | 6 +- + src/models/qwen35moe.cpp | 6 +- + 8 files changed, 292 insertions(+), 23 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index 4e7ab32..951dd21 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2593,6 +2593,23 @@ extern "C" { + struct ggml_tensor * state, + struct ggml_tensor * state_dst); + ++ // Step 2: same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read ++ // directly from the full state cache via per-sequence indices (ids == s_copy), mirroring ++ // ggml_ssm_scan, instead of from a materialized ggml_get_rows gather. `state` is the FULL cache ++ // [S_v, S_v, H, n_rs_slots]; `ids` are the per-seq source slots; `rs_head` is the destination ++ // base slot. Eliminates the recurrent-state gather on the decode path. ++ GGML_API struct ggml_tensor * ggml_gated_delta_net_inplace_ids( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst, ++ struct ggml_tensor * ids, ++ int rs_head); ++ + // custom operators + + typedef void (*ggml_custom1_op_t)(struct ggml_tensor * dst , const struct ggml_tensor * a, int ith, int nth, void * userdata); +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index 9457add..b6a1976 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -10633,7 +10633,7 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + const int64_t K = ggml_get_op_params_i32(dst, 0); + GGML_ASSERT(K >= 1); + // per-seq stride in floats (seq s starts at state + s * seq_stride) +- const int64_t state_seq_stride = src_state->nb[3] / sizeof(float); ++ int64_t state_seq_stride = src_state->nb[3] / sizeof(float); + + const int64_t per_thread = S_v + (K > 1 ? S_v * S_v : 0); + const int ith = params->ith; +@@ -10654,6 +10654,26 @@ static void ggml_compute_forward_gated_delta_net_one_chunk( + + const float * state_in_base = (const float *)src_state->data; + ++ // Step 2: fused recurrent-state gather (ids == s_copy in src[7]). Read the prior state directly ++ // from the full cache at cache[ids[seq]] instead of from a materialized gather. For the identity ++ // decode case the prior state is the in-place destination block [rs_head, rs_head+n_seqs); ++ // otherwise the dispatcher has gathered cache[ids[seq]] into the (unused) output-state scratch ++ // region. Bit-identical to the get_rows path. ++ ggml_tensor * src_ids = dst->src[7]; ++ if (src_ids != nullptr) { ++ const int64_t D = S_v * S_v * H; ++ const int32_t rs_head = ggml_get_op_params_i32(dst, 1); ++ const int32_t * ids = (const int32_t *) src_ids->data; ++ bool identity = true; ++ for (int64_t s = 0; s < n_seqs; ++s) { ++ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; } ++ } ++ state_seq_stride = D; ++ state_in_base = identity ++ ? (const float *) src_state->data + (int64_t) rs_head * D ++ : (const float *) state_out_base; // gathered by the dispatcher (non-identity) ++ } ++ + //const int64_t rq1 = nev1 / neq1; + //const int64_t rk1 = nev1 / nek1; + const int64_t rq3 = nev3 / neq3; +@@ -10777,6 +10797,33 @@ static void ggml_compute_forward_gated_delta_net_f32( + + if (ith == 0) { + ggml_threadpool_chunk_set(params->threadpool, nth); ++ ++ // Step 2: non-identity ids fallback -- serially gather each sequence's prior state from ++ // cache[ids[seq]] into the (otherwise unused) output-state scratch region before the parallel ++ // recurrence, so the in-place write never aliases another sequence's read. ++ ggml_tensor * src_ids = dst->src[7]; ++ if (src_ids != nullptr) { ++ const ggml_tensor * src_state = dst->src[5]; ++ const int64_t S_v = V->ne[0]; ++ const int64_t H = V->ne[1]; ++ const int64_t n_tokens = V->ne[2]; ++ const int64_t n_seqs = V->ne[3]; ++ const int64_t D = S_v * S_v * H; ++ const int32_t rs_head = ggml_get_op_params_i32(dst, 1); ++ const int32_t * ids = (const int32_t *) src_ids->data; ++ bool identity = true; ++ for (int64_t s = 0; s < n_seqs; ++s) { ++ if (ids[s] != rs_head + (int32_t) s) { identity = false; break; } ++ } ++ if (!identity) { ++ const int64_t attn_score_elems = S_v * H * n_tokens * n_seqs; ++ const float * cache = (const float *) src_state->data; ++ float * scratch = (float *) dst->data + attn_score_elems; ++ for (int64_t s = 0; s < n_seqs; ++s) { ++ memcpy(scratch + s * D, cache + (int64_t) ids[s] * D, D * sizeof(float)); ++ } ++ } ++ } + } + + ggml_barrier(params->threadpool); +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index 61a2b91..86d5e2a 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -1,6 +1,34 @@ + #include "gated_delta_net.cuh" + #include "ggml-cuda/common.cuh" + ++// Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a ++// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the ++// destination slot by the recurrence kernel and are skipped here. One block per sequence. ++__global__ void gdn_gather_nonident_kernel(const float * cache, const int32_t * ids, int rs_head, ++ float * scratch, int64_t D, int n_seqs) { ++ const int s = blockIdx.x; ++ if (s >= n_seqs) { ++ return; ++ } ++ const int r = ids[s]; ++ if (r == rs_head + s) { ++ return; // identity: prior state already lives in the in-place destination slot ++ } ++ const float * src = cache + (int64_t) r * D; ++ float * dst = scratch + (int64_t) s * D; ++ for (int64_t i = threadIdx.x; i < D; i += blockDim.x) { ++ dst[i] = src[i]; ++ } ++} ++ ++static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * ids, int rs_head, ++ float * scratch, int64_t D, int64_t n_seqs, cudaStream_t stream) { ++ if (n_seqs <= 0) { ++ return; ++ } ++ gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs); ++} ++ + template + __global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2) + gated_delta_net_cuda(const float * q, +@@ -26,7 +54,9 @@ gated_delta_net_cuda(const float * q, + const uint3 rq3_magic, + float scale, + int K, +- float * state_dst) { ++ float * state_dst, ++ const int32_t * ids, ++ int rs_head) { + const uint32_t h_idx = blockIdx.x; + const uint32_t sequence = blockIdx.y; + // each warp owns one column, using warp-level primitives to reduce across rows +@@ -48,7 +78,15 @@ gated_delta_net_cuda(const float * q, + const int64_t state_in_offset = sequence * H * S_v * S_v + h_idx * S_v * S_v; + const int64_t state_out_offset = (sequence * H + h_idx) * S_v * S_v; + state += state_out_offset; +- curr_state += state_in_offset + col * S_v; ++ // Step 2: select the prior-state read base per sequence. For the ids variant, identity ++ // sequences (ids[seq] == rs_head + seq) read s0 directly from the in-place destination slot ++ // state_dst (no materialization); non-identity sequences read from the pre-gathered scratch ++ // (curr_state). state_in_offset == state_out_offset, so both bases use the same per-(seq,head) ++ // offset. The whole s0 is loaded into registers before the new state is written, so reading and ++ // writing the same slot per block (identity) is race-free. ++ const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence) ++ ? state_dst : curr_state; ++ read_state += state_in_offset + col * S_v; + attn_data += (sequence * n_tokens * H + h_idx) * S_v; + + constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v; +@@ -61,7 +99,7 @@ gated_delta_net_cuda(const float * q, + #pragma unroll + for (int r = 0; r < rows_per_lane; r++) { + const int i = r * warp_size + lane; +- s_shard[r] = curr_state[i]; ++ s_shard[r] = read_state[i]; + } + + for (int t = 0; t < n_tokens; t++) { +@@ -176,6 +214,7 @@ static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, + const float * g_d, const float * b_d, const float * s_d, + float * dst_d, float * state_dst_d, ++ const int32_t * ids_d, int rs_head, + int64_t S_v, int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sv1, int64_t sv2, int64_t sv3, +@@ -199,26 +238,26 @@ static void launch_gated_delta_net( + ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); + break; + case 32: + ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); + break; + case 64: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); + break; + } + case 128: { + ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params, + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, + n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d); ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); + break; + } + default: +@@ -262,7 +301,6 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + const float * g_d = (const float *) src_g->data; + const float * b_d = (const float *) src_beta->data; + +- const float * s_d = (const float *) src_state->data; + float * dst_d = (float *) dst->data; + + float * state_dst_d = nullptr; +@@ -274,6 +312,29 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + state_dst_d = (float *) src_state_dst->data; + } + ++ // Step 2: fused recurrent-state gather (src[7] = ids == s_copy). Read the prior state directly ++ // from the full cache via ids instead of from a materialized ggml_get_rows gather. The recurrence ++ // kernel reads identity sequences (ids[seq] == rs_head + seq) in place from state_dst (no ++ // materialization at all); any non-identity sequence (reorder / rs_zero remap) is gathered here ++ // into a disjoint scratch that the kernel reads instead. The gather writes a disjoint buffer and ++ // the recurrence never reads a slot another block writes, so it is race-free and bit-identical to ++ // the get_rows path. ids stays a DEVICE pointer (dereferenced only inside the kernels). ++ ggml_tensor * src_ids = dst->src[7]; ++ const float * s_d = (const float *) src_state->data; ++ const int32_t * ids_d = nullptr; ++ int rs_head = 0; ++ ggml_cuda_pool_alloc ids_state_scratch(ctx.pool()); ++ if (src_ids != nullptr) { ++ GGML_ASSERT(state_dst_d != nullptr); ++ GGML_ASSERT(src_ids->type == GGML_TYPE_I32); ++ rs_head = ggml_get_op_params_i32(dst, 1); ++ ids_d = (const int32_t *) src_ids->data; ++ const int64_t D = S_v * S_v * H; ++ float * scratch = ids_state_scratch.alloc((size_t) D * n_seqs); ++ ggml_cuda_gdn_gather_nonident(s_d, ids_d, rs_head, scratch, D, n_seqs, ctx.stream()); ++ s_d = scratch; ++ } ++ + GGML_ASSERT(ggml_is_contiguous_rows(src_q)); + GGML_ASSERT(ggml_is_contiguous_rows(src_k)); + GGML_ASSERT(ggml_is_contiguous_rows(src_v)); +@@ -307,21 +368,21 @@ void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * + + if (kda) { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } + } else { + if (keep_rs) { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } else { +- launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ++ launch_gated_delta_net(q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, + S_v, H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, + sb1, sb2, sb3, neqk1, rq3, scale, K, stream); + } +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index b8d34bf..1762037 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -6353,6 +6353,82 @@ struct ggml_tensor * ggml_gated_delta_net_inplace( + return result; + } + ++// ggml_gated_delta_net_inplace_ids ++// ++// Same recurrence as ggml_gated_delta_net_inplace, but the prior recurrent state is read directly ++// from the FULL state cache `state` ([S_v, S_v, H, n_rs_slots]) at cache[ids[seq]] (mirroring ++// ggml_ssm_scan's ids source) instead of from a materialized ggml_get_rows gather. `rs_head` is the ++// destination base slot, used by the backend to detect the common identity case (ids[s] == rs_head ++// + s), where the prior state already lives in the in-place destination slots. ++struct ggml_tensor * ggml_gated_delta_net_inplace_ids( ++ struct ggml_context * ctx, ++ struct ggml_tensor * q, ++ struct ggml_tensor * k, ++ struct ggml_tensor * v, ++ struct ggml_tensor * g, ++ struct ggml_tensor * beta, ++ struct ggml_tensor * state, ++ struct ggml_tensor * state_dst, ++ struct ggml_tensor * ids, ++ int rs_head) { ++ GGML_ASSERT(ggml_is_contiguous_rows(q)); ++ GGML_ASSERT(ggml_is_contiguous_rows(k)); ++ GGML_ASSERT(ggml_is_contiguous_rows(v)); ++ GGML_ASSERT(ggml_is_contiguous(g)); ++ GGML_ASSERT(ggml_is_contiguous(beta)); ++ GGML_ASSERT(ggml_is_contiguous(state)); ++ ++ GGML_ASSERT(q->type == GGML_TYPE_F32); ++ GGML_ASSERT(k->type == GGML_TYPE_F32); ++ GGML_ASSERT(v->type == GGML_TYPE_F32); ++ GGML_ASSERT(g->type == GGML_TYPE_F32); ++ GGML_ASSERT(beta->type == GGML_TYPE_F32); ++ GGML_ASSERT(state->type == GGML_TYPE_F32); ++ GGML_ASSERT(state_dst != NULL && state_dst->type == GGML_TYPE_F32); ++ GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32); ++ ++ const int64_t S_v = v->ne[0]; ++ const int64_t H = v->ne[1]; ++ const int64_t n_tokens = v->ne[2]; ++ const int64_t n_seqs = v->ne[3]; ++ ++ GGML_ASSERT(g->ne[0] == 1 || g->ne[0] == S_v); ++ GGML_ASSERT(beta->ne[0] == 1); ++ ++ // state is the FULL recurrent-state cache: [S_v, S_v, H, n_rs_slots], n_rs_slots >= n_seqs ++ GGML_ASSERT(state->ne[0] == S_v); ++ GGML_ASSERT(state->ne[1] == S_v); ++ GGML_ASSERT(state->ne[2] == H); ++ GGML_ASSERT(state->ne[3] >= n_seqs); ++ ++ // state_dst holds the per-seq final state contiguously: [S_v*S_v*H, >= n_seqs] ++ GGML_ASSERT(state_dst->ne[0] == S_v * S_v * H); ++ GGML_ASSERT(state_dst->ne[1] >= n_seqs); ++ GGML_ASSERT(state_dst->nb[0] == sizeof(float)); ++ ++ // ids: per-seq source slot into the full cache (s_copy_main) ++ GGML_ASSERT(ids->ne[0] >= n_seqs); ++ ++ const int64_t state_rows = S_v * n_seqs; // K == 1 ++ const int64_t ne[4] = { S_v * H, n_tokens * n_seqs + state_rows, 1, 1 }; ++ struct ggml_tensor * result = ggml_new_tensor(ctx, GGML_TYPE_F32, 4, ne); ++ ++ ggml_set_op_params_i32(result, 0, 1); // K == 1 ++ ggml_set_op_params_i32(result, 1, rs_head); // destination base slot (for the ids identity check) ++ ++ result->op = GGML_OP_GATED_DELTA_NET; ++ result->src[0] = q; ++ result->src[1] = k; ++ result->src[2] = v; ++ result->src[3] = g; ++ result->src[4] = beta; ++ result->src[5] = state; // FULL cache (read via ids) ++ result->src[6] = state_dst; // in-place final-state write-back target ++ result->src[7] = ids; // per-seq source slots (s_copy) ++ ++ return result; ++} ++ + //////////////////////////////////////////////////////////////////////////////// + + struct ggml_hash_set ggml_hash_set_new(size_t size) { +diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp +index 26a718b..194e611 100644 +--- a/src/models/delta-net-base.cpp ++++ b/src/models/delta-net-base.cpp +@@ -524,6 +524,69 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state( + return conv_input; + } + ++// Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused ++// gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy ++// ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused ++// and rollback paths fall back to materializing the prior state and delegating below. ++ggml_tensor * llm_build_delta_net_base::build_recurrent_attn( ++ llm_graph_input_rs * inp, ++ ggml_tensor * ssm_states_all, ++ ggml_tensor * q, ++ ggml_tensor * k, ++ ggml_tensor * v, ++ ggml_tensor * g, ++ ggml_tensor * b, ++ int il) { ++ const auto * mctx_cur = inp->mctx; ++ const auto kv_head = mctx_cur->get_head(); ++ ++ const int64_t S_v = v->ne[0]; ++ const int64_t H_v = v->ne[1]; ++ const int64_t n_seqs = v->ne[3]; ++ const int64_t n_seq_tokens = q->ne[2]; ++ ++ const bool keep = cparams.n_rs_seq > 0; ++ const bool fused = (n_seq_tokens == 1) ? cparams.fused_gdn_ar : cparams.fused_gdn_ch; ++ ++ if (!keep && fused) { ++ // build_rs feeds the FULL state cache + the s_copy ids into the op (via the get_state_rows ++ // lambda, exactly like mamba-base's ggml_ssm_scan) and still performs the rs_zero clear and ++ // the extra-states copy around it. The op reads curr_state from cache[ids[seq]] and writes ++ // the final state in place at kv_head; no recurrent-state materialization at all. ++ auto get_state_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * { ++ ggml_tensor * cache4d = ggml_reshape_4d(ctx, states, S_v, S_v, H_v, states->ne[1]); ++ ggml_tensor * state_dst = ggml_view_2d(ctx, ssm_states_all, hparams.n_embd_s(), n_seqs, ++ ssm_states_all->nb[1], kv_head * hparams.n_embd_s() * ggml_element_size(ssm_states_all)); ++ return ggml_gated_delta_net_inplace_ids(ctx, q, k, v, g, b, cache4d, state_dst, ids, (int) kv_head); ++ }; ++ ++ ggml_tensor * result = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs, get_state_op); ++ if (n_seq_tokens == 1) { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_AR, il); ++ } else { ++ cb(result, LLAMA_TENSOR_NAME_FGDN_CH, il); ++ } ++ ++ ggml_tensor * output = ggml_view_4d(ctx0, result, ++ S_v, H_v, n_seq_tokens, n_seqs, ++ ggml_row_size(result->type, S_v), ++ ggml_row_size(result->type, S_v * H_v), ++ ggml_row_size(result->type, S_v * H_v * n_seq_tokens), 0); ++ cb(output, "attn_output", il); ++ ++ // the state write is a side effect of the op; pull the op into the graph via the output ++ ggml_build_forward_expand(gf, output); ++ ++ return output; ++ } ++ ++ // non-fused / rollback: materialize the prior state via gather and delegate to the ++ // state-taking overload (its fused !keep branch performs the Step-1 in-place write). ++ ggml_tensor * s = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs); ++ s = ggml_reshape_4d(ctx0, s, S_v, S_v, H_v, n_seqs); ++ return build_recurrent_attn(inp, ssm_states_all, q, k, v, g, b, s, il); ++} ++ + ggml_tensor * llm_build_delta_net_base::build_recurrent_attn( + llm_graph_input_rs * inp, + ggml_tensor * ssm_states_all, +diff --git a/src/models/models.h b/src/models/models.h +index 2ac8415..98b89e9 100644 +--- a/src/models/models.h ++++ b/src/models/models.h +@@ -88,6 +88,19 @@ struct llm_build_delta_net_base : public llm_graph_context { + ggml_tensor * b, + ggml_tensor * s, + int il); ++ ++ // Step 2: gather-free variant. Reads the prior recurrent state directly from the full cache via ++ // the s_copy ids (no ggml_get_rows materialization) on the fused decode/prefill path, and ++ // delegates to the state-taking overload for the non-fused and rollback paths. ++ ggml_tensor * build_recurrent_attn( ++ llm_graph_input_rs * inp, ++ ggml_tensor * ssm_states_all, ++ ggml_tensor * q, ++ ggml_tensor * k, ++ ggml_tensor * v, ++ ggml_tensor * g, ++ ggml_tensor * b, ++ int il); + }; + + struct llm_build_rwkv6_base : public llm_graph_context { +diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp +index 6783d98..0be3247 100644 +--- a/src/models/qwen35.cpp ++++ b/src/models/qwen35.cpp +@@ -385,10 +385,6 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear( + + ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); + +- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs); +- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs); +- cb(state, "state_predelta", il); +- + ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); + cb(conv_output_proper, "conv_output_raw", il); + +@@ -445,7 +441,7 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear( + cb(k_conv, "k_conv_predelta", il); + cb(v_conv, "v_conv_predelta", il); + +- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il); ++ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il); + + // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim] + ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs); +diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp +index eb5e9a4..2995f04 100644 +--- a/src/models/qwen35moe.cpp ++++ b/src/models/qwen35moe.cpp +@@ -409,10 +409,6 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear( + + ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); + +- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs); +- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs); +- cb(state, "state_predelta", il); +- + ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); + cb(conv_output_proper, "conv_output_raw", il); + +@@ -469,7 +465,7 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear( + cb(k_conv, "k_conv_predelta", il); + cb(v_conv, "v_conv_predelta", il); + +- ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, state, il); ++ ggml_tensor * output = build_recurrent_attn(inp, ssm_states_all, q_conv, k_conv, v_conv, gate, beta, il); + + // z: [head_dim, n_heads, n_tokens, n_seqs] -> [n_heads * n_tokens * n_seqs, head_dim] + ggml_tensor * z_2d = ggml_reshape_4d(ctx0, z, head_v_dim, num_v_heads, n_seq_tokens, n_seqs); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch new file mode 100644 index 000000000000..67333913c871 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0020-qwen35-gdn-oproj-mmq-reshape.patch @@ -0,0 +1,140 @@ +From df1cc97b68df048834ab735c944b71c3a2e8737e Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 12:40:49 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape + (patch 0020) + +Lever 1, the single biggest decode-parity lever for the Qwen3.6 hybrid-SSM +models (arch qwen35: 48 gated-DeltaNet + 16 full-attention layers). Post-SSM +(patches 0018 + 0019) dense decode sat at 255 t/s = 65% of vLLM 391; profiling +both engines pinned the largest llama-specific overage to the gated-DeltaNet +OUTPUT projection (ssm_out). + +The GDN op left its output in SSM layout and the graph reshaped it to 3D +[value_dim, n_seq_tokens=1, n_seqs=128] before the ssm_out matmul, so +src1->ne[1]=1. That trips the ggml-cuda MMVQ dispatch (ne[1] <= 8) with the 128 +sequences stuck in ne[2]; MMVQ is built for batch <= 8 and does not amortize the +ssm_out weight read across the 128 sequences (one 5120x128 grid, 48 calls/step, +the 40%-vs-62% GPU-utilization gap). vLLM packs the same projection into one +M=128 GEMM. The in-projection was already 2D -> MMQ; only the output was 3D. + +The fix collapses the GDN output to 2D [value_dim, n_seq_tokens * n_seqs] +(= [6144, 128] at decode) before the ssm_out ggml_mul_mat, so src1->ne[1]=128 +routes to the MMQ M=128 tensor-core GEMM (which amortizes the weight read across +all 128 tokens). The result is then already 2D, so the redundant post-matmul +reshape_2d is dropped. Same contiguous data, just a 2D vs 3D view: bit-identical. +Gated to the gated-DeltaNet path (qwen35 / qwen35moe / qwen3next); other archs +untouched. + +Bit-identical greedy (--temp 0 --seed 1) vs the post-SSM baseline on both +q36-27b-nvfp4 (dense) and q36-35b-a3b-nvfp4 (MoE), byte/md5-identical. +test-backend-ops MUL_MAT and MUL_MAT_ID OK. + +decode_agg S_TG (llama-batched-bench, -fa on, npp128 ntg128, npl 32/128): + dense q36-27b: 170.52 / 254.92 -> 200.00 / 335.80 t/s (+17.3% / +31.7%) + MoE q36-35b-a3b: 373.28 / 560.66 -> 420.77 / 691.24 t/s (+12.7% / +23.3%) +Dense @128 = 335.80 t/s = 85.9% of vLLM 391 (up from 65%; target 82-85% hit). + +nsys: the o_proj mul_mat_vec_q bucket (132.8 ms / 48 inst) collapses +to zero; mul_mat_q absorbs it (+1200 inst, +363 ms) with a LOWER +per-call average (620.8 -> 582.7 us). Realized o_proj-as-MMQ cost ~0.30 ms/call +vs 2.77 ms/call for the old GEMV. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/models/qwen35.cpp | 13 ++++--- + src/models/qwen35moe.cpp | 13 ++++--- + src/models/qwen3next.cpp | 13 ++++--- + 3 files changed, 21 insertions(+), 18 deletions(-) + +diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp +index 0be3247..0874c43 100644 +--- a/src/models/qwen35.cpp ++++ b/src/models/qwen35.cpp +@@ -449,17 +449,18 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp +index 2995f04..1f6f643 100644 +--- a/src/models/qwen35moe.cpp ++++ b/src/models/qwen35moe.cpp +@@ -473,17 +473,18 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output, model.layers[il].ssm_out_s); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp +index 97200a4..bfdf026 100644 +--- a/src/models/qwen3next.cpp ++++ b/src/models/qwen3next.cpp +@@ -519,17 +519,18 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear( + // Apply gated normalization: self.norm(core_attn_out, z) + ggml_tensor * attn_out_norm = build_norm_gated(output, model.layers[il].ssm_norm, z_2d, il); + +- // Final reshape: [head_dim, n_heads, n_tokens, n_seqs] -> [n_tokens, n_seqs, n_heads * head_dim] +- ggml_tensor * final_output = ggml_reshape_3d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens, n_seqs); ++ // Lever 1: collapse the gated-DeltaNet output to 2D [value_dim, n_seq_tokens * n_seqs] so the ++ // ssm_out projection runs as an M = n_seq_tokens*n_seqs MMQ tensor-core GEMM. The prior ++ // reshape_3d to [value_dim, 1, n_seqs] left src1->ne[1]=1, routing decode to the batch-1 MMVQ ++ // GEMV which does not amortize the ssm_out weight read across the sequences. Same contiguous ++ // data, just a 2D vs 3D view, so the result is bit-identical. ++ ggml_tensor * final_output = ggml_reshape_2d(ctx0, attn_out_norm, head_v_dim * num_v_heads, n_seq_tokens * n_seqs); + cb(final_output, "final_output", il); + +- // Output projection ++ // Output projection (output is already 2D [n_embd, n_seq_tokens * n_seqs]) + cur = build_lora_mm(model.layers[il].ssm_out, final_output); + cb(cur, "linear_attn_out", il); + +- // Reshape back to original dimensions +- cur = ggml_reshape_2d(ctx0, cur, n_embd, n_seq_tokens * n_seqs); +- + return cur; + } + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch new file mode 100644 index 000000000000..f61183cde591 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0021-qwen35-conv-state-inplace-fusion.patch @@ -0,0 +1,655 @@ +From 58426b58aaf5431a59d499d513b2fe2d6ab990d8 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 18:55:54 +0200 +Subject: [PATCH] feat(paged): qwen35 decode conv-state in-place fusion (patch + 0021) + +The no-regret bit-exact conv-state cleanup from the GDN recurrence byte-gate +design (point 3). After the recurrence verdict (NO-BUILD: the gated-DeltaNet +recurrence is already single-pass at the f32 byte floor), the decode conv path +was the only remaining bit-exact lever. + +New fused op ggml_ssm_conv_update_inplace (reuses GGML_OP_SSM_CONV, discriminated +by a non-null src[3]). On the single-token decode path it replaces the four-op +conv chain - qkv transpose + ggml_concat (concat_cont) + ggml_ssm_conv + ggml_silu ++ ggml_cpy of the shifted ring state (cpy_scalar) - with one kernel that, per +(channel, sequence), assembles the width-K window in registers from the K-1 cached +taps plus the current qkv_mixed token, computes the depthwise conv with the SAME +ascending-tap FMA order as ssm_conv_f32 at i==0, folds silu, writes the conv +output, and writes the 1-token-shifted ring state back IN PLACE into the conv +cache slot at kv_head. This is vLLM causal_conv1d_update; it mirrors the 0018 +in-place write-back and 0019 patterns. Read source (the build_rs tap gather) and +write target (the cache view) are disjoint buffers, so it is race-free by +construction with no ids/identity logic. + +- ggml.h/ggml.c: builder (src0=conv_states [K-1,ch,n_seqs], src1=conv_kernel, + src2=x_cur [ch,1,n_seqs], src3=conv_state_dst [(K-1)*ch,n_seqs] in-place ring; + op_params[0]=fuse_silu) +- ggml-cuda/ssm-conv.cu: ssm_conv_update_f32 kernel + + ggml_cuda_op_ssm_conv_update + src[3]-discriminated branch in ggml_cuda_op_ssm_conv +- ggml-cpu/ops.cpp: ggml_compute_forward_ssm_conv_update_f32 (threads over channels) + + branch in ggml_compute_forward_ssm_conv +- delta-net-base.cpp/models.h: build_conv_state_fused (keeps the cheap build_rs + conv-tap gather; fuses conv+silu+shifted write-back) +- qwen35.cpp, qwen35moe.cpp, qwen3next.cpp: route the single-token decode path + (n_seq_tokens==1 && n_rs_seq==0 && fused_gdn_ar); prefill/chunked/rollback keep + the original chain +- tests/test-backend-ops.cpp: test_ssm_conv_update (16 cases) vs the CPU reference + +test-backend-ops: SSM_CONV 45/45, SSM_CONV_UPDATE 16/16, SSM_CONV_BIAS_SILU 90/90. + +Greedy (--temp 0 --seed 1 --ignore-eos -n 256) byte-identical to the Lever-1 +(0019/0020) baseline: q36-27b-nvfp4 md5 675cd522..., q36-35b-a3b-nvfp4 md5 +ac163882... both BYTE-IDENTICAL. + +decode_agg S_TG (npp128 ntg128, -fa on, CUDA-graph), same session: + dense q36-27b-nvfp4 : npl 32 199.76 -> 202.99 (+1.6%) + npl 128 336.35 -> 347.14 (+3.2%, 86.0 -> 88.8 percent of vLLM 391) + MoE q36-35b-a3b : npl 32 421.72 -> 432.39 (+2.5%) + npl 128 689.74 -> 713.54 (+3.5%) +Lift holds in eager too (dense npl128 333.62 -> 342.97). Step -11.9 ms/step +(dense npl128: 380.6 -> 368.7). nsys eager decode: concat_cont (1152 calls) and the +decode cpy_scalar GONE; ssm_conv_f32 at decode replaced by ssm_conv_update (1152); +conv-path ~20.9 -> ~7.6 ms/step. Bit-exact, no regression, de-risks the bf16-state +conv-cache plumbing. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 16 +++++ + ggml/src/ggml-cpu/ops.cpp | 73 ++++++++++++++++++++- + ggml/src/ggml-cuda/ssm-conv.cu | 112 +++++++++++++++++++++++++++++++++ + ggml/src/ggml.c | 54 ++++++++++++++++ + src/models/delta-net-base.cpp | 51 +++++++++++++++ + src/models/models.h | 14 +++++ + src/models/qwen35.cpp | 23 +++++-- + src/models/qwen35moe.cpp | 23 +++++-- + src/models/qwen3next.cpp | 29 ++++++--- + tests/test-backend-ops.cpp | 47 ++++++++++++++ + 10 files changed, 420 insertions(+), 22 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index 951dd21..76fa401 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2447,6 +2447,22 @@ extern "C" { + struct ggml_tensor * sx, + struct ggml_tensor * c); + ++ // Fused decode-time depthwise causal conv1d update (mirrors vLLM causal_conv1d_update). Assembles ++ // the width-K conv window in registers from the cached K-1 taps (`conv_states`, [K-1, channels, ++ // n_seqs]) plus the single current token (`x_cur`, [channels, 1, n_seqs]), computes the depthwise ++ // conv with the SAME ascending-tap FMA order as ggml_ssm_conv, optionally folds SiLU, and writes ++ // the 1-token-shifted ring state back IN PLACE into `conv_state_dst` (a [(K-1)*channels, n_seqs] ++ // view into the conv-state cache). This eliminates the concat + transpose + scalar copy-back + ++ // separate silu of the decode conv path. Output: [channels, 1, n_seqs]. Reuses GGML_OP_SSM_CONV; ++ // detected by the backends via a non-null src[3]. n_seq_tokens must be 1 (single-token decode). ++ GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * conv_states, ++ struct ggml_tensor * conv_kernel, ++ struct ggml_tensor * x_cur, ++ struct ggml_tensor * conv_state_dst, ++ bool fuse_silu); ++ + GGML_API struct ggml_tensor * ggml_ssm_scan( + struct ggml_context * ctx, + struct ggml_tensor * s, +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index b6a1976..f9cd850 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -9463,13 +9463,84 @@ static void ggml_compute_forward_ssm_conv_f32( + } + } + ++// Fused decode-time depthwise causal conv1d update (mirror of the CUDA ssm_conv_update_f32). Reads the ++// K-1 cached taps (src[0]) and the single new token (src[2]), computes the depthwise conv with the same ++// ascending-tap FMA order as ggml_compute_forward_ssm_conv_f32, optionally folds silu, writes the conv ++// output to dst, and writes the 1-token-shifted ring state back in place into src[3]. Threads split ++// over channels. ++static void ggml_compute_forward_ssm_conv_update_f32( ++ const ggml_compute_params * params, ++ ggml_tensor * dst) { ++ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs] ++ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels] ++ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs] ++ ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target ++ ++ const int ith = params->ith; ++ const int nth = params->nth; ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = conv_states->ne[2]; ++ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0; ++ ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float)); ++ ++ const int64_t states_seq_stride = conv_states->nb[2] / sizeof(float); ++ const int64_t states_ch_stride = conv_states->nb[1] / sizeof(float); ++ const int64_t w_stride = conv_kernel->nb[1] / sizeof(float); ++ const int64_t x_seq_stride = x_cur->nb[2] / sizeof(float); ++ const int64_t dst_seq_stride = dst->nb[2] / sizeof(float); ++ const int64_t cdst_seq_stride = cdst->nb[1] / sizeof(float); ++ ++ const float * states_base = (const float *) conv_states->data; ++ const float * w_base = (const float *) conv_kernel->data; ++ const float * x_base = (const float *) x_cur->data; ++ float * cdst_base = (float *) cdst->data; ++ float * dst_base = (float *) dst->data; ++ ++ const int64_t dc = (channels + nth - 1) / nth; ++ const int64_t c0 = dc * ith; ++ const int64_t c1 = MIN(c0 + dc, channels); ++ ++ for (int64_t s = 0; s < n_seqs; ++s) { ++ for (int64_t c = c0; c < c1; ++c) { ++ const float * states_c = states_base + s * states_seq_stride + c * states_ch_stride; ++ const float * w_c = w_base + c * w_stride; ++ const float xc = x_base[s * x_seq_stride + c]; ++ ++ // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv) ++ float sumf = 0.0f; ++ for (int64_t j = 0; j < d_conv - 1; ++j) { ++ sumf += states_c[j] * w_c[j]; ++ } ++ sumf += xc * w_c[d_conv - 1]; ++ sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0 ++ ++ dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf; ++ ++ // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc] ++ float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1); ++ for (int64_t j = 0; j < d_conv - 2; ++j) { ++ out_state[j] = states_c[j + 1]; ++ } ++ out_state[d_conv - 2] = xc; ++ } ++ } ++} ++ + void ggml_compute_forward_ssm_conv( + const ggml_compute_params * params, + ggml_tensor * dst) { + switch (dst->src[0]->type) { + case GGML_TYPE_F32: + { +- ggml_compute_forward_ssm_conv_f32(params, dst); ++ if (dst->src[3] != nullptr) { ++ ggml_compute_forward_ssm_conv_update_f32(params, dst); ++ } else { ++ ggml_compute_forward_ssm_conv_f32(params, dst); ++ } + } break; + default: + { +diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu +index 1463169..e1af1cd 100644 +--- a/ggml/src/ggml-cuda/ssm-conv.cu ++++ b/ggml/src/ggml-cuda/ssm-conv.cu +@@ -123,6 +123,109 @@ static __global__ void ssm_conv_long_token_f32(const float * __restrict__ src0, + } + } + ++// Fused decode-time depthwise causal conv1d update (one new token). Each thread owns one channel of ++// one sequence: it assembles the width-d_conv window from the K-1 cached taps (conv_states) plus the ++// current token (x_cur), computes the depthwise conv with the SAME ascending-tap FMA order as ++// ssm_conv_f32 at i==0, optionally folds silu, writes the conv output, and writes the 1-token-shifted ++// ring state back in place into conv_state_dst. Bit-identical to ssm_conv(concat) + silu + copy-back. ++template ++static __global__ void ssm_conv_update_f32(const float * __restrict__ conv_states, ++ const float * __restrict__ conv_kernel, ++ const float * __restrict__ x_cur, ++ float * __restrict__ conv_state_dst, ++ float * __restrict__ dst, ++ const int channels, ++ const int states_seq_stride, ++ const int w_stride, ++ const int x_seq_stride, ++ const int dst_seq_stride, ++ const int cdst_seq_stride) { ++ const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel ++ const int s = blockIdx.y; // sequence ++ if (c >= channels) { ++ return; ++ } ++ ++ const float * states_c = conv_states + (int64_t) s * states_seq_stride + (int64_t) c * (d_conv - 1); ++ const float * w_c = conv_kernel + (int64_t) c * w_stride; ++ const float xc = x_cur[(int64_t) s * x_seq_stride + c]; ++ ++ // window = [tap0 .. tap_{K-2}, current-token], same ordering as the concat(conv_states, x) window ++ float window[d_conv]; ++#pragma unroll ++ for (int j = 0; j < d_conv - 1; j++) { ++ window[j] = states_c[j]; ++ } ++ window[d_conv - 1] = xc; ++ ++ float sumf = 0.0f; ++#pragma unroll ++ for (int j = 0; j < d_conv; j++) { ++ sumf += window[j] * w_c[j]; ++ } ++ sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias) ++ dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf; ++ ++ // 1-token-shifted ring write-back: drop the oldest tap, append the current token ++ float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1); ++#pragma unroll ++ for (int j = 0; j < d_conv - 1; j++) { ++ out_state[j] = window[j + 1]; ++ } ++} ++ ++static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { ++ const ggml_tensor * conv_states = dst->src[0]; // [K-1, channels, n_seqs] ++ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels] ++ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs] ++ const ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = conv_states->ne[2]; ++ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0; ++ ++ GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32); ++ GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32); ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float)); ++ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float)); ++ GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs); ++ ++ const float * states_d = (const float *) conv_states->data; ++ const float * w_d = (const float *) conv_kernel->data; ++ const float * x_d = (const float *) x_cur->data; ++ float * cdst_d = (float *) cdst->data; ++ float * dst_d = (float *) dst->data; ++ cudaStream_t stream = ctx.stream(); ++ ++ const int states_seq_stride = (int) (conv_states->nb[2] / sizeof(float)); ++ const int w_stride = (int) (conv_kernel->nb[1] / sizeof(float)); ++ const int x_seq_stride = (int) (x_cur->nb[2] / sizeof(float)); ++ const int dst_seq_stride = (int) (dst->nb[2] / sizeof(float)); ++ const int cdst_seq_stride = (int) (cdst->nb[1] / sizeof(float)); ++ ++ const int threads = 128; ++ const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1); ++ ++ auto launch = [&](auto NC) { ++ constexpr int kNC = decltype(NC)::value; ++ if (apply_silu) { ++ ssm_conv_update_f32<<>>(states_d, w_d, x_d, cdst_d, dst_d, ++ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride); ++ } else { ++ ssm_conv_update_f32<<>>(states_d, w_d, x_d, cdst_d, dst_d, ++ (int) channels, states_seq_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride); ++ } ++ }; ++ ++ switch (d_conv) { ++ case 3: launch(std::integral_constant{}); break; ++ case 4: launch(std::integral_constant{}); break; ++ default: GGML_ABORT("ssm_conv_update only supports d_conv 3 or 4"); ++ } ++} ++ + template + static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1, + const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1, +@@ -158,6 +261,15 @@ static void ssm_conv_f32_cuda(const float * src0, const float * src1, const floa + } + + void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, ggml_tensor * bias_add_node, ggml_tensor * silu_dst) { ++ // Fused decode conv-update-in-place variant (ggml_ssm_conv_update_inplace): discriminated by a ++ // non-null src[3] (the in-place ring write-back target). It folds the concat/transpose/copy-back/ ++ // silu of the decode conv path into a single kernel. ++ if (dst->src[3] != nullptr) { ++ GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr); ++ ggml_cuda_op_ssm_conv_update(ctx, dst); ++ return; ++ } ++ + const struct ggml_tensor * src0 = dst->src[0]; // conv_x + const struct ggml_tensor * src1 = dst->src[1]; // conv1d.weight + const bool fuse_bias = bias_add_node != nullptr; +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index 1762037..b777748 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -5555,6 +5555,60 @@ struct ggml_tensor * ggml_ssm_conv( + return result; + } + ++// ggml_ssm_conv_update_inplace ++// ++// Fused decode-time depthwise causal conv1d update. Reuses GGML_OP_SSM_CONV but is discriminated by a ++// non-null src[3]. The op reads each channel's K-1 cached taps from `conv_states` and the single new ++// token from `x_cur`, computes the depthwise conv (ascending-tap FMA, bit-identical to ggml_ssm_conv), ++// optionally folds SiLU, writes the conv output to dst ([channels, 1, n_seqs]) and writes the ++// 1-token-shifted ring state back in place into `conv_state_dst` (the active sequences' conv-cache ++// slot). op_params[0] carries the fuse_silu flag. Mirrors the 0018/0019 in-place state pattern. ++struct ggml_tensor * ggml_ssm_conv_update_inplace( ++ struct ggml_context * ctx, ++ struct ggml_tensor * conv_states, ++ struct ggml_tensor * conv_kernel, ++ struct ggml_tensor * x_cur, ++ struct ggml_tensor * conv_state_dst, ++ bool fuse_silu) { ++ GGML_ASSERT(ggml_is_3d(conv_states)); ++ GGML_ASSERT(ggml_is_matrix(conv_kernel)); ++ GGML_ASSERT(ggml_is_3d(x_cur)); ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = conv_states->ne[2]; ++ ++ GGML_ASSERT(conv_states->type == GGML_TYPE_F32); ++ GGML_ASSERT(conv_kernel->type == GGML_TYPE_F32); ++ GGML_ASSERT(x_cur->type == GGML_TYPE_F32); ++ GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32); ++ ++ // conv_states: [K-1, channels, n_seqs], contiguous taps per channel ++ GGML_ASSERT(conv_states->ne[0] == d_conv - 1); ++ GGML_ASSERT(conv_states->ne[1] == channels); ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ // x_cur: single decode token per sequence ++ GGML_ASSERT(x_cur->ne[0] == channels); ++ GGML_ASSERT(x_cur->ne[1] == 1); ++ GGML_ASSERT(x_cur->ne[2] == n_seqs); ++ // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target ++ GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels); ++ GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs); ++ GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float)); ++ ++ struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs); ++ ++ ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0); ++ ++ result->op = GGML_OP_SSM_CONV; ++ result->src[0] = conv_states; ++ result->src[1] = conv_kernel; ++ result->src[2] = x_cur; ++ result->src[3] = conv_state_dst; ++ ++ return result; ++} ++ + // ggml_ssm_scan + + struct ggml_tensor * ggml_ssm_scan( +diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp +index 194e611..0eee804 100644 +--- a/src/models/delta-net-base.cpp ++++ b/src/models/delta-net-base.cpp +@@ -524,6 +524,57 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state( + return conv_input; + } + ++// Fused decode conv path (patch 0021). Reads the active sequences' prior conv-state taps (the same ++// cheap build_rs gather as build_conv_state), then fuses the depthwise conv + silu + the 1-token- ++// shifted ring write-back into a single ggml_ssm_conv_update_inplace op. This removes the concat ++// (concat_cont), the transpose materialization, the scalar copy-back (cpy_scalar) and the separate ++// silu of the decode conv path. The op reads from the (disjoint) materialized taps and writes the ++// new ring state in place into the cache slot at kv_head -- exactly the slot the baseline ggml_cpy ++// wrote -- so it is bit-identical to build_conv_state + ggml_ssm_conv + ggml_silu. ++ggml_tensor * llm_build_delta_net_base::build_conv_state_fused( ++ llm_graph_input_rs * inp, ++ ggml_tensor * conv_states_all, ++ ggml_tensor * qkv_mixed, ++ ggml_tensor * conv_kernel, ++ int64_t conv_kernel_size, ++ int64_t conv_channels, ++ int il) { ++ const auto * mctx_cur = inp->mctx; ++ const auto kv_head = mctx_cur->get_head(); ++ ++ const int64_t n_seqs = ubatch.n_seqs; ++ const int64_t n_seq_tokens = ubatch.n_seq_tokens; ++ ++ GGML_ASSERT(n_seq_tokens == 1); // single-token decode only ++ GGML_ASSERT(cparams.n_rs_seq == 0); // no rollback splits on this path ++ ++ // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows ++ // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets). ++ ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs); ++ conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs); ++ cb(conv_states, "conv_states_reshaped", il); ++ ++ // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs]. ++ ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs); ++ ++ // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the ++ // destination the baseline ggml_cpy wrote to (s_slot == 0). ++ const int64_t row_count = (conv_kernel_size - 1) * conv_channels; ++ const size_t row_size = ggml_row_size(conv_states_all->type, row_count); ++ ggml_tensor * conv_state_dst = ++ ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size); ++ cb(conv_state_dst, "conv_state_update", il); ++ ++ ggml_tensor * conv_output = ++ ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true); ++ cb(conv_output, "conv_output_silu", il); ++ ++ // the ring write is a side effect of the op; pull the op into the graph via the output ++ ggml_build_forward_expand(gf, conv_output); ++ ++ return conv_output; // [conv_channels, 1, n_seqs], already silu'd ++} ++ + // Step 2: gather-free recurrent attention. Mirrors mamba-base's get_ssm_rows pattern: the fused + // gated-DeltaNet op reads each sequence's prior state directly from the full cache via the s_copy + // ids (no ggml_get_rows materialization) and writes the new state in place (Step 1). The non-fused +diff --git a/src/models/models.h b/src/models/models.h +index 98b89e9..da0dd86 100644 +--- a/src/models/models.h ++++ b/src/models/models.h +@@ -76,6 +76,20 @@ struct llm_build_delta_net_base : public llm_graph_context { + int64_t conv_channels, + int il); + ++ // Fused decode-time conv path (patch 0021). Replaces the concat + transpose + ssm_conv + silu + ++ // copy-back chain with a single ggml_ssm_conv_update_inplace op that reads the cached K-1 taps and ++ // the current token, computes the depthwise conv, folds silu, and writes the 1-token-shifted ring ++ // state back in place. Decode-only (n_seq_tokens == 1, n_rs_seq == 0). Returns the silu'd conv ++ // output: (conv_channels, 1, n_seqs). Bit-identical to the build_conv_state + ggml_ssm_conv chain. ++ ggml_tensor * build_conv_state_fused( ++ llm_graph_input_rs * inp, ++ ggml_tensor * conv_states_all, ++ ggml_tensor * qkv_mixed, ++ ggml_tensor * conv_kernel, ++ int64_t conv_kernel_size, ++ int64_t conv_channels, ++ int il); ++ + // run delta-net attention and write the new recurrent state(s) back to ssm_states_all + // s: (head_v_dim, head_v_dim, num_v_heads, n_seqs); returns output: (head_v_dim, num_v_heads, n_seq_tokens, n_seqs) + ggml_tensor * build_recurrent_attn( +diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp +index 0874c43..b6dcc5f 100644 +--- a/src/models/qwen35.cpp ++++ b/src/models/qwen35.cpp +@@ -383,15 +383,26 @@ ggml_tensor * llama_model_qwen35::graph::build_layer_attn_linear( + const int64_t conv_kernel_size = conv_kernel->ne[0]; + const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state; + +- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); ++ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv + ++ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the ++ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below. ++ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar; ++ ++ ggml_tensor * conv_qkv_mix; ++ if (conv_decode_fused) { ++ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel, ++ conv_kernel_size, conv_channels, il); ++ } else { ++ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); + +- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); +- cb(conv_output_proper, "conv_output_raw", il); ++ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); ++ cb(conv_output_proper, "conv_output_raw", il); + +- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); +- cb(conv_output_silu, "conv_output_silu", il); ++ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); ++ cb(conv_output_silu, "conv_output_silu", il); + +- ggml_tensor * conv_qkv_mix = conv_output_silu; ++ conv_qkv_mix = conv_output_silu; ++ } + + // Calculate the total conv dimension + int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads; +diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp +index 1f6f643..c7c7c44 100644 +--- a/src/models/qwen35moe.cpp ++++ b/src/models/qwen35moe.cpp +@@ -407,15 +407,26 @@ ggml_tensor * llama_model_qwen35moe::graph::build_layer_attn_linear( + const int64_t conv_kernel_size = conv_kernel->ne[0]; + const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state; + +- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); ++ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv + ++ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the ++ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below. ++ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar; ++ ++ ggml_tensor * conv_qkv_mix; ++ if (conv_decode_fused) { ++ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel, ++ conv_kernel_size, conv_channels, il); ++ } else { ++ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); + +- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); +- cb(conv_output_proper, "conv_output_raw", il); ++ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); ++ cb(conv_output_proper, "conv_output_raw", il); + +- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); +- cb(conv_output_silu, "conv_output_silu", il); ++ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); ++ cb(conv_output_silu, "conv_output_silu", il); + +- ggml_tensor * conv_qkv_mix = conv_output_silu; ++ conv_qkv_mix = conv_output_silu; ++ } + + // Calculate the total conv dimension + int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads; +diff --git a/src/models/qwen3next.cpp b/src/models/qwen3next.cpp +index bfdf026..92749d1 100644 +--- a/src/models/qwen3next.cpp ++++ b/src/models/qwen3next.cpp +@@ -434,19 +434,30 @@ ggml_tensor * llama_model_qwen3next::graph::build_layer_attn_linear( + const int64_t conv_kernel_size = conv_kernel->ne[0]; + const int64_t conv_channels = d_inner + 2 * hparams.ssm_n_group * hparams.ssm_d_state; + +- ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); ++ // Patch 0021: on the single-token decode path, fuse the conv window assembly + depthwise conv + ++ // silu + the 1-token-shifted ring write-back into one in-place op (removes concat_cont, the ++ // transpose materialization, cpy_scalar and the separate silu). Bit-identical to the chain below. ++ const bool conv_decode_fused = (n_seq_tokens == 1) && (cparams.n_rs_seq == 0) && cparams.fused_gdn_ar; ++ ++ ggml_tensor * conv_qkv_mix; ++ if (conv_decode_fused) { ++ conv_qkv_mix = build_conv_state_fused(inp, conv_states_all, qkv_mixed, conv_kernel, ++ conv_kernel_size, conv_channels, il); ++ } else { ++ ggml_tensor * conv_input = build_conv_state(inp, conv_states_all, qkv_mixed, conv_kernel_size, conv_channels, il); + +- ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs); +- state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs); +- cb(state, "state_predelta", il); ++ ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); ++ cb(conv_output_proper, "conv_output_raw", il); + +- ggml_tensor * conv_output_proper = ggml_ssm_conv(ctx0, conv_input, conv_kernel); +- cb(conv_output_proper, "conv_output_raw", il); ++ ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); ++ cb(conv_output_silu, "conv_output_silu", il); + +- ggml_tensor * conv_output_silu = ggml_silu(ctx0, conv_output_proper); +- cb(conv_output_silu, "conv_output_silu", il); ++ conv_qkv_mix = conv_output_silu; ++ } + +- ggml_tensor * conv_qkv_mix = conv_output_silu; ++ ggml_tensor * state = build_rs(inp, ssm_states_all, hparams.n_embd_s(), n_seqs); ++ state = ggml_reshape_4d(ctx0, state, head_v_dim, head_v_dim, num_v_heads, n_seqs); ++ cb(state, "state_predelta", il); + + // Calculate the total conv dimension + int64_t qkv_dim = head_k_dim * num_k_heads * 2 + head_v_dim * num_v_heads; +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 291c275..c7348d6 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -3748,6 +3748,43 @@ struct test_ssm_conv_bias_silu : public test_case { + } + }; + ++// GGML_OP_SSM_CONV fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). ++// Validates the conv + silu output (dst) against the CPU reference across backends. The 1-token- ++// shifted ring write-back to conv_state_dst is a side effect (validated end-to-end by the greedy ++// md5 gate); here it just exercises the in-place write target as an op src. ++struct test_ssm_conv_update : public test_case { ++ const int64_t d_conv; ++ const int64_t channels; ++ const int64_t n_seqs; ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "SSM_CONV_UPDATE"; ++ } ++ ++ std::string vars() override { ++ return VARS_TO_STR3(d_conv, channels, n_seqs); ++ } ++ ++ test_ssm_conv_update(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4) ++ : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {} ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ ggml_tensor * conv_states = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs); ++ ggml_tensor * conv_kernel = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels); ++ ggml_tensor * x_cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs); ++ ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs); ++ ggml_set_name(conv_states, "conv_states"); ++ ggml_set_name(conv_kernel, "conv_kernel"); ++ ggml_set_name(x_cur, "x_cur"); ++ ggml_set_name(conv_state_dst, "conv_state_dst"); ++ ++ ggml_tensor * out = ggml_ssm_conv_update_inplace(ctx, conv_states, conv_kernel, x_cur, conv_state_dst, true); ++ ggml_set_name(out, "out"); ++ return out; ++ } ++}; ++ + // GGML_OP_SSM_SCAN + struct test_ssm_scan : public test_case { + const ggml_type type; +@@ -8355,6 +8392,16 @@ static std::vector> make_test_cases_eval() { + } + } + ++ // fused decode conv-update-in-place (ggml_ssm_conv_update_inplace, patch 0021). channels must be ++ // a multiple of 128 for the CUDA SSM_CONV supports_op gate. ++ for (int64_t d_conv : {3, 4}) { ++ for (int64_t channels : {256, 3328}) { ++ for (int64_t n_seqs : {1, 4, 32, 128}) { ++ test_cases.emplace_back(new test_ssm_conv_update(d_conv, channels, n_seqs)); ++ } ++ } ++ } ++ + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1 + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2 + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64, 8, 2, 32, 4)); // Falcon-H1 +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch new file mode 100644 index 000000000000..6b6eae468c8a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0022-qwen35-gdn-recurrence-occupancy-retune.patch @@ -0,0 +1,403 @@ +From 8a3229f41d5b712e87901796dfae3faee1f2f07d Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 20:32:55 +0200 +Subject: [PATCH] feat(paged): qwen35 gated-DeltaNet decode + occupancy/coalescing retune (patch 0022) + +Bit-exact occupancy retune of gated_delta_net_cuda, the B=128 decode recurrence +kernel. After the f32 verdict (vLLM carries the gated-DeltaNet temporal state in +float32 and moves the same ~805 MB/call as llama; the gap was pure DRAM bandwidth +efficiency on equal bytes - llama 73.4% vs vLLM 82.4% of the 273 GB/s GB10 peak), +the lever is a latency-coverage retune that keeps the per-column f32 reduction/FMA +order byte-identical (md5-gateable). The bf16-state plan stays shelved. + +Column folding: two new template params NUM_WARPS (default 4) and COLS_PER_WARP +(default 1). Each warp now owns COLS_PER_WARP columns of the 128x128 recurrent +state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, +... within a per-block column tile of NUM_WARPS*COLS_PER_WARP columns; +grid.z = S_v / (NUM_WARPS*COLS_PER_WARP). The S_v rows of every column stay sharded +across the lanes by the same strided i = r*warp_size + lane mapping, and every +column's per-lane FMA accumulation and warp_reduce_sum butterfly are byte-for-byte +unchanged; only the (warp,block)->column assignment and visit order differ, which a +column's value provably does not depend on (columns are fully independent). This +raises per-warp memory-level parallelism ~COLS_PER_WARP-fold (independent +state-load bursts before any reduction + interleaved butterfly reductions hiding +each other's shfl latency), covering more DRAM latency on this bandwidth-bound +kernel. Every global access stays identically coalesced, so it is a scheduling / +latency-coverage win, not a coalescing change. The forbidden float4 state load +(which would repartition a lane to 4 contiguous rows and change the reduction +grouping) is NOT done, so the md5 stays invariant. The S_v=128 tile is +env-selectable (GDN_NW / GDN_CPW) for one-build re-tuning; default is the measured +GB10 winner (16, 8). + +GB10 (CUDA 13, sm_121, nsys CUPTI timing - HW counters perm-blocked): +gated_delta_net B=128 decode call (805.3 MB f32 R+W) 4.02 -> 3.49 ms/call, +200.3 -> 230.9 GB/s = 73.4% -> 84.6% of 273 GB/s peak (now above vLLM's 82.4%; +102.6% of vLLM's recurrence bandwidth). decode S_TG t/s (npp128 ntg128, -fa on): +dense 27b npl128 335.9 -> 373.2 (+11.1%), npl32 199.2 -> 207.6 (+4.2%); MoE +35b-a3b npl128 688.4 -> 745.7 (+8.3%), npl32 420.6 -> 440.0 (+4.6%). Prefill +unchanged. + +Bit-exact: greedy --temp 0 --seed 1 md5 byte-identical to the 0021 baseline on +both q36-27b-nvfp4 and q36-35b-a3b-nvfp4 (winner 16x8 and 4x1 control); +test-backend-ops -o GATED_DELTA_NET 36/36 PASS. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/gated_delta_net.cu | 236 +++++++++++++++++--------- + 1 file changed, 157 insertions(+), 79 deletions(-) + +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index 86d5e2a..d071d5a 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -1,6 +1,8 @@ + #include "gated_delta_net.cuh" + #include "ggml-cuda/common.cuh" + ++#include ++ + // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a + // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the + // destination slot by the recurrence kernel and are skipped here. One block per sequence. +@@ -29,8 +31,22 @@ static void ggml_cuda_gdn_gather_nonident(const float * cache, const int32_t * i + gdn_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>(cache, ids, rs_head, scratch, D, (int) n_seqs); + } + +-template +-__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * 4, 2) ++// Occupancy/coalescing retune (patch 0022). Each warp owns COLS_PER_WARP columns of the recurrent ++// state instead of 1, looping the existing per-column body over col, col+NUM_WARPS, ... within a ++// per-block column tile of size NUM_WARPS*COLS_PER_WARP. The S_v rows of every column stay sharded ++// across the lanes by the SAME strided mapping i = r*warp_size + lane, and every column's per-lane ++// FMA accumulation and warp_reduce_sum butterfly are byte-for-byte unchanged. Only the ++// (warp,block)->column assignment and the order a warp visits its columns differ, and a column's ++// f32 value provably does not depend on either (columns are fully independent: column c reads only ++// its own S_v-float state slice plus the shared per-(token,head,seq) q/k/v/g/beta). So the result ++// and the stored final state are bit-identical to the COLS_PER_WARP==1 baseline (md5-gateable), ++// while per-warp memory-level parallelism rises ~COLS_PER_WARP-fold (COLS_PER_WARP independent ++// state-load bursts issued before any reduction, and the independent butterfly reductions interleave ++// to hide each other's shfl latency) which covers more DRAM latency on this bandwidth-bound kernel. ++// Every individual global access stays IDENTICALLY coalesced (32 consecutive lanes -> one 128B ++// sector), so this is a latency-coverage / scheduling win, not a coalescing change. ++template ++__global__ void __launch_bounds__((ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v) * NUM_WARPS, MIN_BLOCKS) + gated_delta_net_cuda(const float * q, + const float * k, + const float * v, +@@ -59,9 +75,9 @@ gated_delta_net_cuda(const float * q, + int rs_head) { + const uint32_t h_idx = blockIdx.x; + const uint32_t sequence = blockIdx.y; +- // each warp owns one column, using warp-level primitives to reduce across rows ++ // each warp owns COLS_PER_WARP columns, using warp-level primitives to reduce across rows. + const int lane = threadIdx.x; +- const int col = blockIdx.z * blockDim.y + threadIdx.y; ++ const int col_base = blockIdx.z * (NUM_WARPS * COLS_PER_WARP) + threadIdx.y; + + const uint32_t iq1 = fastmodulo(h_idx, neqk1_magic); + const uint32_t iq3 = fastdiv(sequence, rq3_magic); +@@ -86,20 +102,25 @@ gated_delta_net_cuda(const float * q, + // writing the same slot per block (identity) is race-free. + const float * read_state = (ids != nullptr && ids[sequence] == rs_head + (int) sequence) + ? state_dst : curr_state; +- read_state += state_in_offset + col * S_v; ++ read_state += state_in_offset; + attn_data += (sequence * n_tokens * H + h_idx) * S_v; + + constexpr int warp_size = ggml_cuda_get_physical_warp_size() < S_v ? ggml_cuda_get_physical_warp_size() : S_v; + static_assert(S_v % warp_size == 0, "S_v must be a multiple of warp_size"); + constexpr int rows_per_lane = (S_v + warp_size - 1) / warp_size; +- float s_shard[rows_per_lane]; +- // state is stored transposed: M[col][i] = S[i][col], row col is contiguous ++ // per-column register shard of the recurrent state; state is stored transposed: M[col][i] = S[i][col]. ++ float s_shard[COLS_PER_WARP][rows_per_lane]; + + ggml_cuda_pdl_sync(); + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- s_shard[r] = read_state[i]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ const float * rs = read_state + col * S_v; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ s_shard[cc][r] = rs[i]; ++ } + } + + for (int t = 0; t < n_tokens; t++) { +@@ -113,7 +134,7 @@ gated_delta_net_cuda(const float * q, + + const float beta_val = *beta_t; + +- // Cache k and q in registers ++ // Cache k and q in registers (shared across the COLS_PER_WARP columns of this warp). + float k_reg[rows_per_lane]; + float q_reg[rows_per_lane]; + #pragma unroll +@@ -126,59 +147,69 @@ gated_delta_net_cuda(const float * q, + if constexpr (!KDA) { + const float g_val = expf(*g_t); + +- // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i] +- float kv_shard = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- kv_shard += s_shard[r] * k_reg[r]; +- } +- float kv_col = warp_reduce_sum(kv_shard); ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; + +- // delta[col] = (v[col] - g * kv[col]) * beta +- float delta_col = (v_t[col] - g_val * kv_col) * beta_val; ++ // kv[col] = (S^T @ k)[col] = sum_i S[i][col] * k[i] ++ float kv_shard = 0.0f; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ kv_shard += s_shard[cc][r] * k_reg[r]; ++ } ++ float kv_col = warp_reduce_sum(kv_shard); + +- // fused: S[i][col] = g * S[i][col] + k[i] * delta[col] +- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] +- float attn_partial = 0.0f; ++ // delta[col] = (v[col] - g * kv[col]) * beta ++ float delta_col = (v_t[col] - g_val * kv_col) * beta_val; ++ ++ // fused: S[i][col] = g * S[i][col] + k[i] * delta[col] ++ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] ++ float attn_partial = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- s_shard[r] = g_val * s_shard[r] + k_reg[r] * delta_col; +- attn_partial += s_shard[r] * q_reg[r]; +- } ++ for (int r = 0; r < rows_per_lane; r++) { ++ s_shard[cc][r] = g_val * s_shard[cc][r] + k_reg[r] * delta_col; ++ attn_partial += s_shard[cc][r] * q_reg[r]; ++ } + +- float attn_col = warp_reduce_sum(attn_partial); ++ float attn_col = warp_reduce_sum(attn_partial); + +- if (lane == 0) { +- attn_data[col] = attn_col * scale; ++ if (lane == 0) { ++ attn_data[col] = attn_col * scale; ++ } + } + } else { +- // kv[col] = sum_i g[i] * S[i][col] * k[i] +- float kv_shard = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- kv_shard += expf(g_t[i]) * s_shard[r] * k_reg[r]; +- } ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ ++ // kv[col] = sum_i g[i] * S[i][col] * k[i] ++ float kv_shard = 0.0f; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ kv_shard += expf(g_t[i]) * s_shard[cc][r] * k_reg[r]; ++ } + +- float kv_col = warp_reduce_sum(kv_shard); ++ float kv_col = warp_reduce_sum(kv_shard); + +- // delta[col] = (v[col] - kv[col]) * beta +- float delta_col = (v_t[col] - kv_col) * beta_val; ++ // delta[col] = (v[col] - kv[col]) * beta ++ float delta_col = (v_t[col] - kv_col) * beta_val; + +- // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col] +- // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] +- float attn_partial = 0.0f; ++ // fused: S[i][col] = g[i] * S[i][col] + k[i] * delta[col] ++ // attn[col] = (S^T @ q)[col] = sum_i S[i][col] * q[i] ++ float attn_partial = 0.0f; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- s_shard[r] = expf(g_t[i]) * s_shard[r] + k_reg[r] * delta_col; +- attn_partial += s_shard[r] * q_reg[r]; +- } ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ s_shard[cc][r] = expf(g_t[i]) * s_shard[cc][r] + k_reg[r] * delta_col; ++ attn_partial += s_shard[cc][r] * q_reg[r]; ++ } + +- float attn_col = warp_reduce_sum(attn_partial); ++ float attn_col = warp_reduce_sum(attn_partial); + +- if (lane == 0) { +- attn_data[col] = attn_col * scale; ++ if (lane == 0) { ++ attn_data[col] = attn_col * scale; ++ } + } + } + +@@ -190,11 +221,15 @@ gated_delta_net_cuda(const float * q, + const int64_t state_size_per_token = S_v * S_v * H * n_seqs; // per-slot stride in output + const int target_slot = (int) n_tokens - 1 - t; + if (target_slot >= 0 && target_slot < K) { +- float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset; + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- curr_state[col * S_v + i] = s_shard[r]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++ float * curr_state = (dst + attn_score_elems) + target_slot * state_size_per_token + state_out_offset; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ curr_state[col * S_v + i] = s_shard[cc][r]; ++ } + } + } + } +@@ -202,13 +237,48 @@ gated_delta_net_cuda(const float * q, + + if constexpr (!keep_rs_t) { + #pragma unroll +- for (int r = 0; r < rows_per_lane; r++) { +- const int i = r * warp_size + lane; +- state[col * S_v + i] = s_shard[r]; ++ for (int cc = 0; cc < COLS_PER_WARP; cc++) { ++ const int col = col_base + cc * NUM_WARPS; ++#pragma unroll ++ for (int r = 0; r < rows_per_lane; r++) { ++ const int i = r * warp_size + lane; ++ state[col * S_v + i] = s_shard[cc][r]; ++ } + } + } + } + ++// Default column-folding tile for the S_v==128 decode/prefill path (the GDN head dim of this model). ++// Measured winner of the bit-exact occupancy sweep (patch 0022). Override at runtime for the sweep ++// via GDN_NW / GDN_CPW; all selectable variants are bit-identical, only %peak differs. ++#ifndef GDN_DEFAULT_NW ++#define GDN_DEFAULT_NW 16 ++#endif ++#ifndef GDN_DEFAULT_CPW ++#define GDN_DEFAULT_CPW 8 ++#endif ++ ++template ++static void launch_gdn_variant( ++ const float * q_d, const float * k_d, const float * v_d, ++ const float * g_d, const float * b_d, const float * s_d, ++ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head, ++ int64_t H, int64_t n_tokens, int64_t n_seqs, ++ int64_t sq1, int64_t sq2, int64_t sq3, ++ int64_t sv1, int64_t sv2, int64_t sv3, ++ int64_t sb1, int64_t sb2, int64_t sb3, ++ const uint3 neqk1_magic, const uint3 rq3_magic, ++ float scale, int K, int warp_size, cudaStream_t stream) { ++ static_assert(S_v % (NUM_WARPS * COLS_PER_WARP) == 0, "NUM_WARPS*COLS_PER_WARP must divide S_v"); ++ dim3 grid_dims(H, n_seqs, S_v / (NUM_WARPS * COLS_PER_WARP)); ++ dim3 block_dims(warp_size <= S_v ? warp_size : S_v, NUM_WARPS, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream); ++ ggml_cuda_kernel_launch(gated_delta_net_cuda, launch_params, ++ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, ++ n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, ++ sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++} ++ + template + static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, +@@ -223,47 +293,55 @@ static void launch_gated_delta_net( + float scale, int K, cudaStream_t stream) { + //TODO: Add chunked kernel for even faster pre-fill + const int warp_size = ggml_cuda_info().devices[ggml_cuda_get_device()].warp_size; +- const int num_warps = 4; +- dim3 grid_dims(H, n_seqs, (S_v + num_warps - 1) / num_warps); +- dim3 block_dims(warp_size <= S_v ? warp_size : S_v, num_warps, 1); + + const uint3 neqk1_magic = init_fastdiv_values(neqk1); + const uint3 rq3_magic = init_fastdiv_values(rq3); + +- int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++#define GDN_LAUNCH_ARGS \ ++ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \ ++ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \ ++ neqk1_magic, rq3_magic, scale, K, warp_size, stream + +- const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params(grid_dims, block_dims, 0, stream); + switch (S_v) { + case 16: +- ggml_cuda_kernel_launch(gated_delta_net_cuda<16, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ launch_gdn_variant<16, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; + case 32: +- ggml_cuda_kernel_launch(gated_delta_net_cuda<32, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ launch_gdn_variant<32, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; +- case 64: { +- ggml_cuda_kernel_launch(gated_delta_net_cuda<64, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ case 64: ++ launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; +- } + case 128: { +- ggml_cuda_kernel_launch(gated_delta_net_cuda<128, KDA, keep_rs_t>, launch_params, +- q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, +- n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, +- sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); ++ // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp ++ // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is ++ // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every ++ // selectable {num_warps, cols} is bit-identical, so the sweep cannot change the md5). ++ static const int gdn_nw = []{ const char * e = getenv("GDN_NW"); return e ? atoi(e) : GDN_DEFAULT_NW; }(); ++ static const int gdn_cpw = []{ const char * e = getenv("GDN_CPW"); return e ? atoi(e) : GDN_DEFAULT_CPW; }(); ++ // NUM_WARPS in {4,8,16} x COLS_PER_WARP ladder (all <=512 threads/block, no 1024-thread ++ // .minnctapersm warnings). Measured GB10 %peak: (4,1)=73 baseline ... (16,4)=82 ... ++ // (16,8)=84.7 winner ~ tied with (8,8)/(8,16)/(32,4); the plateau is just above vLLM (82.4). ++ if (gdn_nw == 4 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 4 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 4, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 4 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 4, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 8, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 8, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 8, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 8 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 8, 8, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 1) launch_gdn_variant<128, KDA, keep_rs_t, 16, 1, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 2) launch_gdn_variant<128, KDA, keep_rs_t, 16, 2, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 4) launch_gdn_variant<128, KDA, keep_rs_t, 16, 4, 2>(GDN_LAUNCH_ARGS); ++ else if (gdn_nw == 16 && gdn_cpw == 8) launch_gdn_variant<128, KDA, keep_rs_t, 16, 8, 2>(GDN_LAUNCH_ARGS); ++ else launch_gdn_variant<128, KDA, keep_rs_t, GDN_DEFAULT_NW, GDN_DEFAULT_CPW, 2>(GDN_LAUNCH_ARGS); + break; + } + default: + GGML_ABORT("fatal error"); + break; + } ++ ++#undef GDN_LAUNCH_ARGS + } + + void ggml_cuda_op_gated_delta_net(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch new file mode 100644 index 000000000000..566baa391658 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0023-qwen35moe-nvfp4-quant-dedup.patch @@ -0,0 +1,144 @@ +From f7409c2de2868a6a048d3c333329468b4cc9e483 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 25 Jun 2026 23:47:25 +0200 +Subject: [PATCH] feat(paged): qwen35moe NVFP4 activation-quantize de-dup + (patch 0023) + +Bit-exact decode/prefill lever for the MoE (qwen3.5moe) NVFP4 path. ggml`s +mul_mat_id quantizes the EXPERT-GATHERED activation rows (ne11_flat = +ne12*n_expert_used). For the broadcast up/gate projections (ne11 == 1) every +expert of a token receives the SAME token activation, so the stock path +re-quantizes each token n_expert_used times. quantize_mmq_nvfp4 produces each +block as a pure per-thread function of its 16 consecutive inputs (no cross-thread +reduction), so the gathered blocks are byte-identical across the experts. + +Lever: when ne11 == 1, quantize the ne12 UNIQUE token activations once, then +gather the resulting block_fp4_mmq rows into the expert-gathered layout keyed by +ids_src1 with a coalesced uint4 copy (block_fp4_mmq == 9 uint4 == 144 B). Pure +byte copy of identical blocks, so the gathered buffer is byte-for-byte identical +to re-quantizing each gathered row; the GEMM is untouched. down_proj +(ne11 == n_expert_used, distinct per expert) keeps the stock path. + +Measured GB10 (sm_121a), on top of HEAD 8a3229f (patch 0022), q36-35b-a3b-nvfp4: +- nsys decode-isolated: quantize_mmq_nvfp4 868 -> 457 ms/run (-411 ms), new + gather_mmq_fp4 +32 ms; net -379 ms of decode GPU-time. +- S_TG npl128 745.2 -> 758.1 t/s (+1.73%), npl32 +0.6%; prefill T_PP -4%. +- Dense q36-27b-nvfp4 byte-flat (no mul_mat_id): 373.24 t/s unchanged. + +Bit-exact gate (greedy --temp 0 --seed 1 md5, byte-identical to 0022): + q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439 (unchanged) + q36-35b-a3b-nvfp4 07db32c2bcb78d17a43ed18bc22705cd (de-dup on == off) + test-backend-ops MUL_MAT 1115/1115, MUL_MAT_ID 805/805. + +On by default; GGML_CUDA_MOE_QUANT_DEDUP=0 restores the stock re-quantize path. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/mmq.cu | 21 +++++++++++++++++-- + ggml/src/ggml-cuda/quantize.cu | 37 +++++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/quantize.cuh | 4 ++++ + 3 files changed, 60 insertions(+), 2 deletions(-) + +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index e1add5e..9933fa6 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -1,3 +1,4 @@ ++#include + #include "common.cuh" + #include "mmq.cuh" + #include "quantize.cuh" +@@ -197,8 +198,24 @@ void ggml_cuda_mul_mat_q( + const int64_t s13 = src1->nb[3] / ts_src1; + + if (use_native_fp4) { +- quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13, +- ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream); ++ // 0023: de-dup the broadcast (up/gate) quantize. ne11==1 means src1 is shared ++ // across experts, so quantize the ne12 unique tokens once and gather the blocks. ++ static const bool moe_quant_dedup = []{ ++ const char * e = getenv("GGML_CUDA_MOE_QUANT_DEDUP"); ++ return e ? atoi(e) != 0 : true; // 0023: on by default; GGML_CUDA_MOE_QUANT_DEDUP=0 disables ++ }(); ++ if (moe_quant_dedup && ne11 == 1) { ++ const size_t nbytes_unique = ne12*ne10_padded * sizeof(block_q8_1)/QK8_1 + ++ get_mmq_x_max_host(cc)*sizeof(block_q8_1_mmq); ++ ggml_cuda_pool_alloc src1_unique(ctx.pool(), nbytes_unique); ++ quantize_mmq_fp4_cuda(src1_d, nullptr, src1_unique.get(), src0->type, ne10, s12, 0, 0, ++ ne10_padded, ne12, 1, 1, stream); ++ gather_mmq_fp4_cuda(src1_unique.get(), ids_src1.get(), src1_q8_1.get(), ++ ne11_flat, ne12, ne10_padded, stream); ++ } else { ++ quantize_mmq_fp4_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13, ++ ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream); ++ } + } else { + quantize_mmq_q8_1_cuda(src1_d, ids_src1.get(), src1_q8_1.get(), src0->type, ne10, s11, s12, s13, + ne10_padded, ne11_flat, ne12_flat, ne13_flat, stream); +diff --git a/ggml/src/ggml-cuda/quantize.cu b/ggml/src/ggml-cuda/quantize.cu +index 39a500a..a7fd86f 100644 +--- a/ggml/src/ggml-cuda/quantize.cu ++++ b/ggml/src/ggml-cuda/quantize.cu +@@ -419,6 +419,43 @@ void quantize_mmq_q8_1_cuda( + } + } + ++// MoE NVFP4 quantize de-dup (0023): for the broadcast (up/gate) expert matmuls every ++// gathered row references one of ne12 unique token activations, so the stock path ++// re-quantizes each token n_expert_used times. Quantize the unique tokens once, then copy ++// the resulting block_fp4_mmq rows into the expert-gathered layout keyed by ids. This is a ++// pure byte copy of identical blocks => the gathered buffer is byte-identical to stock. ++static __global__ void gather_mmq_fp4( ++ const uint4 * __restrict__ unique, const int32_t * __restrict__ ids, ++ uint4 * __restrict__ gathered, const int ne11_flat, const int ne12_unique, ++ const int64_t total_words) { ++ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); // 9 uint4 per 144B block ++ const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x; ++ if (t >= total_words) { ++ return; ++ } ++ const int w = (int) (t % W); ++ const int64_t ib = t / W; // destination block index = kb*ne11_flat + j ++ const int j = (int) (ib % ne11_flat); ++ const int kb = (int) (ib / ne11_flat); ++ const int src = ids[j]; ++ const int64_t ib_u = (int64_t) kb * ne12_unique + src; ++ gathered[t] = unique[ib_u * W + w]; ++} ++ ++void gather_mmq_fp4_cuda( ++ const void * unique, const int32_t * ids, void * gathered, ++ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, cudaStream_t stream) { ++ const int blocks_per_col = (int) ((ne0_padded + QK_K - 1) / QK_K); ++ constexpr int W = (int) (sizeof(block_fp4_mmq) / sizeof(uint4)); ++ const int64_t total_words = ne11_flat * (int64_t) blocks_per_col * W; ++ const int bs = 256; ++ const dim3 block_size(bs, 1, 1); ++ const dim3 num_blocks((unsigned) ((total_words + bs - 1) / bs), 1, 1); ++ gather_mmq_fp4<<>>( ++ (const uint4 *) unique, ids, (uint4 *) gathered, ++ (int) ne11_flat, (int) ne12_unique, total_words); ++} ++ + void quantize_mmq_fp4_cuda( + const float * x, const int32_t * ids, void * vy, const ggml_type type_src0, + const int64_t ne00, const int64_t s01, const int64_t s02, const int64_t s03, +diff --git a/ggml/src/ggml-cuda/quantize.cuh b/ggml/src/ggml-cuda/quantize.cuh +index 768a3ae..7f64069 100644 +--- a/ggml/src/ggml-cuda/quantize.cuh ++++ b/ggml/src/ggml-cuda/quantize.cuh +@@ -26,6 +26,10 @@ void quantize_mmq_q8_1_cuda( + ggml_type type_src0, int64_t ne00, int64_t s01, int64_t s02, int64_t s03, + int64_t ne0, int64_t ne1, int64_t ne2, int64_t ne3, cudaStream_t stream); + ++void gather_mmq_fp4_cuda(const void * unique, const int32_t * ids, void * gathered, ++ int64_t ne11_flat, int64_t ne12_unique, int64_t ne0_padded, ++ cudaStream_t stream); ++ + void quantize_mmq_fp4_cuda(const float * x, + const int32_t * ids, + void * vy, +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0024-paged-pool-burst-reclaim.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0024-paged-pool-burst-reclaim.patch new file mode 100644 index 000000000000..0b1841275bb3 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0024-paged-pool-burst-reclaim.patch @@ -0,0 +1,357 @@ +From a8a9d129ae2226a08a12c30ece697865c0fc85c4 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Fri, 26 Jun 2026 12:41:49 +0200 +Subject: [PATCH] feat(paged): paged-pool burst-reclaim (truncate + defrag + + slot release) (patch 0024) + +Fixes the paged-pool burst-degradation bug (OTHER_PATHS_INVESTIGATION.md section C +Part 2): on a long-lived llama-server with LLAMA_KV_PAGED=1, a high-fan-out prefill +burst strands KV blocks in the host-side paged pool, so a later lower-npl prefill +draws from a depleted/fragmented pool and its throughput collapses (the benchmark's +"restart per npl" crutch). Decode is unaffected. The fix changes only host-side +block accounting and placement, never KV values or compute, and is gated behind +LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM=1 restores the pre-fix behavior). + +Fix-1 reclaim trailing blocks: PagedKVManager::truncate(seq, n_keep) frees every +block beyond ceil(n_keep/bs) (ref-counted); called from llama_kv_cache::seq_rm for +the p1==MAX && p0>0 partial-tail case so the manager tracks the kv-cache exactly. +Fix-2 defrag on empty: when the pool is fully idle, defrag_free_pool() relinks the +free queue into ascending block-id order (FreeBlockQueue::rebuild), preserving +content-cache hashes. +Fix-3 release on slot completion: server_slot::release() issues prompt_clear() +under the paged engine so a finished-idle slot returns its blocks promptly. + +Validation (DGX GB10, q36-27b-nvfp4 = qwen35 hybrid; HEAD f7409c2 = patch 0023): +- Bit-exact: greedy md5 identical across paged off / paged on / paged on+NO_RECLAIM + (5951a5b4d624ce891e22ab5fca9bc439), == the 0023 baseline. test-backend-ops + unaffected (no ggml op touched). +- Host unit test: truncate reclaims exactly 16 trailing blocks; defrag restores + ascending popleft order. UNIT PASS. +- Model A/B (one binary, NO_RECLAIM): fragmentation prefill ratio 0.944 -> 0.998; + 64 idle slots strand 2048 blocks, reclaim returns the pool to fresh (2527). +- Server A/B (FRESH-npl8 -> BURST-npl64 -> POST-npl8): POST-npl8 prefill collapses + 488 -> 44 t/s with NO_RECLAIM (the bug; investigation saw 507 -> 65), restored to + 532 t/s (fresh 525, within 1%) with the fix. Paged release-log count 17 -> 96 + (Fix-3 fires per slot completion). Canary tokens identical fresh-vs-post in both + arms (bit-exact serving). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/llama-kv-cache.cpp | 13 ++++++++++ + src/paged-alloc.cpp | 31 +++++++++++++++++++++++ + src/paged-alloc.h | 18 +++++++++++++ + src/paged-kv-manager.cpp | 45 +++++++++++++++++++++++++++++++++ + src/paged-kv-manager.h | 24 ++++++++++++++++++ + src/paged-prefix-api.cpp | 8 ++++++ + src/paged-prefix-api.h | 6 +++++ + tools/server/server-context.cpp | 17 +++++++++++++ + 8 files changed, 162 insertions(+) + +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 0351f86..21b8f1e 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -425,6 +425,19 @@ bool llama_kv_cache::seq_rm(llama_seq_id seq_id, llama_pos p0, llama_pos p1) { + } + } + ++ // [paged 0024 Fix-1] Reclaim trailing blocks on a partial TAIL truncation ++ // (p1 == MAX, p0 > 0). llama-server issues seq_rm(slot, n_past, -1) on every ++ // reused slot and before a cross-request prefix splice; the kv-cache frees the ++ // cells [p0, end) but, without this, the paged manager keeps owning those ++ // blocks - the reclamation gap that leaks and fragments the pool across a ++ // burst. truncate() frees the blocks beyond ceil(p0/bs) so the manager's ++ // accounting tracks the kv-cache exactly. Gated so LLAMA_PAGED_NO_RECLAIM ++ // restores the pre-fix behavior for A/B. ++ if (paged_alloc::active() && paged_alloc::reclaim_active() && seq_id >= 0 && ++ p0 > 0 && p1 == std::numeric_limits::max()) { ++ paged_alloc::truncate(this, (int) seq_to_stream[seq_id], (int) seq_id, (uint32_t) p0); ++ } ++ + if (seq_id >= 0) { + auto & cells = v_cells[seq_to_stream[seq_id]]; + auto & head = v_heads[seq_to_stream[seq_id]]; +diff --git a/src/paged-alloc.cpp b/src/paged-alloc.cpp +index c1027fb..ba98dd5 100644 +--- a/src/paged-alloc.cpp ++++ b/src/paged-alloc.cpp +@@ -14,6 +14,11 @@ bool active() { + return a; + } + ++bool reclaim_active() { ++ static const bool off = (std::getenv("LLAMA_PAGED_NO_RECLAIM") != nullptr); ++ return !off; ++} ++ + static bool debug() { + static const bool d = (std::getenv("LLAMA_KV_PAGED_DEBUG") != nullptr); + return d; +@@ -124,12 +129,28 @@ void commit(const void * cache, int stream, int seq, + } + } + ++void truncate(const void * cache, int stream, int seq, uint32_t n_keep) { ++ paged::PagedKVManager * mgr = find_mgr(cache, stream); ++ if (!mgr) { ++ return; ++ } ++ mgr->truncate(seq, (size_t) n_keep); // Fix-1: reclaim trailing blocks ++ mgr->defrag_free_pool(); // Fix-2: compact iff the pool emptied ++ if (debug()) { ++ fprintf(stderr, "[paged-alloc] truncate cache=%p stream=%d seq=%d keep<=%u (free=%zu)\n", ++ cache, stream, seq, n_keep, mgr->num_free_blocks()); ++ } ++} ++ + void release(const void * cache, int stream, int seq) { + paged::PagedKVManager * mgr = find_mgr(cache, stream); + if (!mgr) { + return; + } + mgr->free(seq); // ref-counted: shared blocks survive while another seq holds them ++ if (reclaim_active()) { ++ mgr->defrag_free_pool(); // Fix-2: compact iff the pool emptied ++ } + if (debug()) { + fprintf(stderr, "[paged-alloc] released cache=%p stream=%d seq=%d (free=%zu)\n", + cache, stream, seq, mgr->num_free_blocks()); +@@ -163,4 +184,14 @@ size_t num_free(const void * cache, int stream) { + return mgr ? mgr->num_free_blocks() : 0; + } + ++size_t num_free_global() { ++ size_t total = 0; ++ for (auto & kv : g_managers) total += kv.second->num_free_blocks(); ++ return total; ++} ++ ++size_t num_managers() { ++ return g_managers.size(); ++} ++ + } // namespace paged_alloc +diff --git a/src/paged-alloc.h b/src/paged-alloc.h +index 88dedef..bfaf45b 100644 +--- a/src/paged-alloc.h ++++ b/src/paged-alloc.h +@@ -31,6 +31,12 @@ namespace paged_alloc { + // true iff env LLAMA_KV_PAGED is set (evaluated once). + bool active(); + ++// [paged 0024] The burst-reclaim fix (truncate + defrag-on-empty + slot release) ++// is on by default whenever the paged engine is active. LLAMA_PAGED_NO_RECLAIM=1 ++// restores the pre-fix behavior (no trailing-block reclaim, no compaction) for ++// A/B measurement. Evaluated once. ++bool reclaim_active(); ++ + // Place n_tokens logical positions [base, base+n_tokens) of (cache,stream,seq) + // on demand, appending their physical cell indices to `out`. pool_blocks = + // cells.size()/block_size is the stream's block budget. Returns false (leaving +@@ -58,6 +64,12 @@ int64_t slot(const void * cache, int stream, int seq, int pos); + void commit(const void * cache, int stream, int seq, + const std::vector & tokens, uint32_t block_size, uint32_t pool_blocks); + ++// [paged 0024 Fix-1] Reclaim the trailing blocks of (cache,stream,seq) beyond ++// logical position n_keep (ref-counted), mirroring a partial kv-cache seq_rm ++// [n_keep, end). When the stream's pool empties as a result, its free queue is ++// defragged to pristine contiguous order (Fix-2). No-op if no manager exists. ++void truncate(const void * cache, int stream, int seq, uint32_t n_keep); ++ + // Return one sequence's blocks to the pool (ref-counted; sequence end). + void release(const void * cache, int stream, int seq); + +@@ -69,4 +81,10 @@ void release_all(const void * cache); + int ref_cnt_at(const void * cache, int stream, int seq, int pos, uint32_t block_size); + size_t num_free(const void * cache, int stream); + ++// [paged 0024] Total free blocks summed across every live manager (all caches / ++// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA ++// models whose outer memory is not a llama_kv_cache. Diagnostics only. ++size_t num_free_global(); ++size_t num_managers(); ++ + } // namespace paged_alloc +diff --git a/src/paged-kv-manager.cpp b/src/paged-kv-manager.cpp +index 4c6ee4c..738b332 100644 +--- a/src/paged-kv-manager.cpp ++++ b/src/paged-kv-manager.cpp +@@ -104,6 +104,22 @@ void FreeBlockQueue::prepend_n(const std::vector& blocks) { + num_free_blocks += blocks.size(); + } + ++void FreeBlockQueue::rebuild(const std::vector& blocks) { ++ // Relink the intrusive list using THIS queue's stable fake head/tail nodes. ++ num_free_blocks = blocks.size(); ++ for (size_t i = 0; i < blocks.size(); ++i) { ++ blocks[i]->prev_free = (i == 0) ? &fake_head : blocks[i - 1]; ++ blocks[i]->next_free = (i + 1 < blocks.size()) ? blocks[i + 1] : &fake_tail; ++ } ++ if (!blocks.empty()) { ++ fake_head.next_free = blocks.front(); ++ fake_tail.prev_free = blocks.back(); ++ } else { ++ fake_head.next_free = &fake_tail; ++ fake_tail.prev_free = &fake_head; ++ } ++} ++ + std::vector FreeBlockQueue::get_all_free_blocks() const { + std::vector ret; + const KVCacheBlock* curr = fake_head.next_free; +@@ -199,6 +215,20 @@ void BlockPool::cache_full_blocks(const std::vector& req_blocks, + } + } + ++void BlockPool::defrag_free_queue() { ++ // Pool is fully idle: every non-null block is free (ref_cnt 0). Rebuild the ++ // free list in ascending block_id order so popleft hands out physically ++ // contiguous blocks again. Hashes / the content-cache map are left intact so ++ // a warm committed prefix stays re-hittable. ++ std::vector ordered; ++ ordered.reserve(ptrs_.size()); ++ for (KVCacheBlock* b : ptrs_) { ++ if (b->is_null) continue; ++ ordered.push_back(b); ++ } ++ free_queue_.rebuild(ordered); ++} ++ + // --------------------------------------------------------------------------- + // PagedKVManager (port of SingleTypeKVCacheManager / FullAttentionManager) + // --------------------------------------------------------------------------- +@@ -250,6 +280,21 @@ void PagedKVManager::free(int seq_id) { + req_to_blocks_.erase(it); + } + ++void PagedKVManager::truncate(int seq_id, size_t n_keep) { ++ auto it = req_to_blocks_.find(seq_id); ++ if (it == req_to_blocks_.end()) return; ++ auto & blocks = it->second; ++ const size_t keep = cdiv(n_keep, block_size_); // blocks covering [0, n_keep) ++ if (keep >= blocks.size()) return; // nothing trailing to reclaim ++ // Free the trailing blocks [keep, end) tail-first (vLLM eviction order). Their ++ // cells were just cleared by the partial seq_rm, so they are safe to reuse. ++ std::vector ordered(blocks.rbegin(), ++ blocks.rbegin() + (blocks.size() - keep)); ++ pool_.free_blocks(ordered); ++ blocks.resize(keep); ++ if (blocks.empty()) req_to_blocks_.erase(it); ++} ++ + // FNV-1a chained block hash. Deterministic and prefix-sensitive; folds the parent + // hash into the seed so each block hash transitively encodes its whole prefix + // (behavioral parity with vLLM hash_block_tokens chaining; vLLM uses sha256 bytes). +diff --git a/src/paged-kv-manager.h b/src/paged-kv-manager.h +index 34decbc..e410d58 100644 +--- a/src/paged-kv-manager.h ++++ b/src/paged-kv-manager.h +@@ -47,6 +47,11 @@ public: + void append_n(const std::vector& blocks); + void prepend_n(const std::vector& blocks); + std::vector get_all_free_blocks() const; ++ // [paged 0024 Fix-2] Relink the intrusive free list to the given order using ++ // THIS queue's fake head/tail (the nodes' addresses are stable; a temporary ++ // FreeBlockQueue would leave dangling fake-node pointers). Used to restore a ++ // pristine, contiguous popleft order after a fragmenting burst drains. ++ void rebuild(const std::vector& blocks); + + private: + KVCacheBlock fake_head{-1}; +@@ -67,6 +72,14 @@ public: + size_t num_cached_blocks, size_t num_full_blocks, + const std::vector& block_hashes); + size_t get_num_free_blocks() const { return free_queue_.num_free_blocks; } ++ // [paged 0024 Fix-2] Total non-null blocks, and whether the pool is fully ++ // idle (every non-null block back in the free queue). defrag_free_queue() ++ // relinks the free queue into pristine ascending-block-id order; only valid ++ // when all_free() so no live request's block table is disturbed. Block hashes ++ // are preserved, so a warm committed prefix stays re-hittable. ++ size_t total_blocks() const { return blocks_.size(); } ++ bool all_free() const { return free_queue_.num_free_blocks + 1 == blocks_.size(); } ++ void defrag_free_queue(); + + private: + bool maybe_evict_cached_block(KVCacheBlock* block); +@@ -94,6 +107,17 @@ public: + void free(int seq_id); + int block_size() const { return block_size_; } + ++ // [paged 0024 Fix-1] Reclaim the trailing blocks of seq_id beyond logical ++ // position n_keep: free every block at index >= ceil(n_keep/bs) (ref-counted, ++ // mirroring vLLM's free of a truncated block suffix). Called on a partial tail ++ // seq_rm [n_keep, end) so the manager's block accounting tracks the kv-cache ++ // exactly instead of stranding the blocks whose cells were just cleared. ++ void truncate(int seq_id, size_t n_keep); ++ ++ // [paged 0024 Fix-2] When no live request holds a block, relink the free ++ // queue into pristine contiguous order (undo a burst's scrambled free order). ++ void defrag_free_pool() { if (pool_.all_free()) pool_.defrag_free_queue(); } ++ + // Prefix caching (win 3). + static uint64_t hash_block(uint64_t parent_hash, const std::vector& token_ids); + std::vector compute_block_hashes(const std::vector& token_ids) const; +diff --git a/src/paged-prefix-api.cpp b/src/paged-prefix-api.cpp +index 8573cd2..209cee8 100644 +--- a/src/paged-prefix-api.cpp ++++ b/src/paged-prefix-api.cpp +@@ -45,4 +45,12 @@ long num_free(llama_context * ctx) { + return (long) paged_alloc::num_free((const void *) kv, /*stream=*/0); + } + ++long num_free_global() { ++ return (long) paged_alloc::num_free_global(); ++} ++ ++long num_managers() { ++ return (long) paged_alloc::num_managers(); ++} ++ + } // namespace paged_prefix_api +diff --git a/src/paged-prefix-api.h b/src/paged-prefix-api.h +index 78a3864..8dd817e 100644 +--- a/src/paged-prefix-api.h ++++ b/src/paged-prefix-api.h +@@ -24,4 +24,10 @@ int ref_at(llama_context * ctx, llama_seq_id seq, int pos); + // Number of free blocks in the unified stream-0 pool, or 0 if no manager. + long num_free(llama_context * ctx); + ++// [paged 0024] Total free blocks across every live paged manager (all caches / ++// streams). Wrapper-agnostic, so it reports the real pool for hybrid / iSWA ++// models whose outer memory is not a llama_kv_cache. Diagnostics only. ++long num_free_global(); ++long num_managers(); ++ + } // namespace paged_prefix_api +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index f7a114c..8c19cfb 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -411,6 +411,23 @@ struct server_slot { + + reset(); + ++ // [paged 0024 Fix-3] Return this finished slot's paged blocks to the ++ // pool promptly. Stock llama-server keeps an idle slot's KV for its own ++ // next-prompt cache, but under the paged engine that strands blocks in ++ // idle slots after a high-fan-out burst, so a later low-npl run sees a ++ // depleted, fragmented pool and its prefill collapses. prompt_clear() ++ // issues a full seq_rm (clearing the cells AND, via the paged hook, ++ // releasing + defragging the blocks) and clears the slot-local prompt ++ // cache so the next reuse recomputes from a pristine pool; cross-request ++ // reuse still works through the committed paged content cache. Gated on ++ // LLAMA_KV_PAGED (LLAMA_PAGED_NO_RECLAIM opts out for A/B); stock ++ // (paged off) is byte-identical. ++ static const bool paged_release_on_idle = ++ getenv("LLAMA_KV_PAGED") != nullptr && getenv("LLAMA_PAGED_NO_RECLAIM") == nullptr; ++ if (paged_release_on_idle && prompt.n_tokens() > 0) { ++ prompt_clear(false); ++ } ++ + callback_on_release(id); + } + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch new file mode 100644 index 000000000000..dcbd9d800a6d --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0025-qwen35moe-nvfp4-moe-decode-regraph.patch @@ -0,0 +1,56 @@ +From 2f4f5ab7c9050f890ee1137ef9c8ee09dfcd9ae7 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Fri, 26 Jun 2026 16:52:21 +0200 +Subject: [PATCH] feat(paged): qwen35moe NVFP4 MoE-decode re-graph + (should_use_mmq graph-safe id-path) (patch 0025) + +The MUL_MAT_ID CUDA-graph guard (ggml-cuda.cu [TAG_MUL_MAT_ID_CUDA_GRAPHS]) disables CUDA graphs for +the whole decode step whenever a MUL_MAT_ID node has ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121), +because the per-expert host-loop fallback synchronizes the stream. But on Blackwell NVFP4 the path +actually taken is should_use_mmq()==true -> the grouped stream-k mul_mat_q id-branch, which launches +on one stream with NO host sync (no cudaStreamSynchronize/Memcpy in mmq.cu/mmid.cu). The disable is +therefore conservative; graphs are safe for the grouped path. + +Env-gated (LLAMA_MOE_FORCE_GRAPHS, default-off = byte-identical to stock): when set and the node +takes the grouped MMQ path, keep CUDA graphs on for the MoE decode step. + +Measured (DGX GB10 sm_121, q36-35b-a3b-nvfp4, llama-batched-bench -fa on -npp128 -ntg128, decode_agg): + npl 8 226.0 -> 226.4 +0.2% (noise; ne2<=8 already on the MMVQ-graphed path) + npl 32 433.8 -> 452.7 +4.4% + npl 64 589.0 -> 605.9 +2.9% + npl 128 743.1 -> 757.1 +1.9% + +Bit-exact (graph replay re-issues identical kernels): test-backend-ops MUL_MAT_ID 806/806 CUDA0 OK; +parallel-greedy np16 (ne2=16>8) generated content byte-identical ON==OFF. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 12 +++++++++++- + 1 file changed, 11 insertions(+), 1 deletion(-) + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index cca7059..254d2e0 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -3275,7 +3275,17 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) { + if (node->op == GGML_OP_MUL_MAT_ID) { + const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; + const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc); +- if (!ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max) { ++ bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max; ++ // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is ++ // launched on-stream with no host sync (only the per-expert host-loop fallback syncs); ++ // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe ++ // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step. ++ if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) && ++ getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr && ++ ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) { ++ mmid_needs_sync = false; ++ } ++ if (mmid_needs_sync) { + // under these conditions, the mul_mat_id operation will need to synchronize the stream, so we cannot use CUDA graphs + // TODO: figure out a way to enable for larger batch sizes, without hurting performance + // ref: https://github.com/ggml-org/llama.cpp/pull/18958 +-- +2.43.0 diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch new file mode 100644 index 000000000000..a37395f92868 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0028-qwen35-recurrent-state-gather-fusion.patch @@ -0,0 +1,578 @@ +From fafe8785c8595f53a51efec20cf84f9146437e0c Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Fri, 26 Jun 2026 22:58:47 +0200 +Subject: [PATCH] feat(paged): qwen35 recurrent-state gather fusion (patch + 0028) + +The MoE-gap groundtruth found k_get_rows_float to be the single biggest decode +kernel vLLM has no equivalent of (~5.2 ms/step MoE; also dense): vLLM updates its +gated-DeltaNet recurrent state in place, while llama ran a separate ggml_get_rows +gather. Patch 0019 fused the SSM-state gather; patch 0021 fused the conv compute +but kept a build_rs gather for the conv taps. This closes that residual. + +nsys located the residual k_get_rows as the conv-state tap gather in +build_conv_state_fused: a 24576-float (= n_embd_r = (d_conv-1)*(d_inner + +2*n_group*d_state)) row x 128 sequences, once per GDN layer per decode step +(~720 big ~115 us gathers / 24-step window). The SSM-state gather is already +fused by 0019, so this conv gather is the last k_get_rows in the GDN decode path. + +New op ggml_ssm_conv_update_inplace_ids (reuses GGML_OP_SSM_CONV, discriminated +by a non-null src[4] = ids) takes the FULL conv cache + the s_copy ids and reads +each active sequence's prior taps directly from cache[ids[s]] in the kernel (no +ggml_get_rows). Identity sequences (ids[s] == rs_head + s, the AR-decode path) +read in place from the conv_state_dst write slot (the whole window is loaded into +registers before the ring write-back, so read==write is race-free); non-identity +sequences (reorder / rs_zero) are gathered into a disjoint scratch by a small +ssm_conv_gather_nonident_kernel first. Mirrors the 0019 in-place + ids gather +fusion. The read VALUES are unchanged; only the read path (gather -> indexed +in-kernel read) changes, so it is bit-identical to the build_rs gather + 0021 op. + +build_conv_state_fused now feeds the full cache + ids through the build_rs +get_state_rows lambda (rs_zero clear + extra-states copy still run around it). +Helps BOTH dense and MoE (shared GDN conv path). + +GATE test-backend-ops (CUDA0 vs CPU, 2/2 backends): SSM_CONV_UPDATE_IDS OK (new), +SSM_CONV_UPDATE OK, SSM_CONV OK, GATED_DELTA_NET OK, GET_ROWS OK. + +GATE greedy md5 (--temp 0 --seed 1 -n 48) BYTE-IDENTICAL both models: +q36-27b-nvfp4 5951a5b4d624ce891e22ab5fca9bc439, q36-35b-a3b-nvfp4 +07db32c2bcb78d17a43ed18bc22705cd (== baseline). + +nsys: k_get_rows_float float,float 10174 -> 9454 instances (720 fewer = 30 GDN +layers x 24 steps), 186.3 -> 102.8 ms; the 720 ~115 us conv gathers replaced by a +720 x ~1.1 us no-op ssm_conv_gather_nonident (all identity at steady decode). +MoE npl128 783.9 t/s (step 163.3 ms vs MOE_GAP 169.8 ms @0025), dense 377.3 t/s. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/include/ggml.h | 20 ++++ + ggml/src/ggml-cpu/ops.cpp | 90 +++++++++++++++++- + ggml/src/ggml-cuda/ssm-conv.cu | 155 ++++++++++++++++++++++++++++++- + ggml/src/ggml.c | 62 +++++++++++++ + src/models/delta-net-base.cpp | 26 ++++-- + tests/test-backend-ops.cpp | 69 ++++++++++++++ + 6 files changed, 411 insertions(+), 11 deletions(-) + +diff --git a/ggml/include/ggml.h b/ggml/include/ggml.h +index 2a5cbce..5fa220a 100644 +--- a/ggml/include/ggml.h ++++ b/ggml/include/ggml.h +@@ -2463,6 +2463,26 @@ extern "C" { + struct ggml_tensor * conv_state_dst, + bool fuse_silu); + ++ // Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered ++ // per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels, ++ // n_cells]) plus the per-sequence `ids` ([n_seqs], I32, = the recurrent-state s_copy) and reads ++ // each active sequence's prior taps directly from cache[ids[s]] inside the kernel -- no ++ // ggml_get_rows materialization (mirrors ggml_gated_delta_net_inplace_ids). Identity sequences ++ // (ids[s] == rs_head + s) are read in place from `conv_state_dst` (the write slot); any ++ // non-identity sequence (reorder / rs_zero remap) is gathered into a disjoint scratch by the ++ // backend first, so the read never aliases another sequence's in-place ring write -> race-free ++ // and bit-identical to the get_rows + ggml_ssm_conv_update_inplace path. op_params[0]=fuse_silu, ++ // op_params[1]=rs_head. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4]. ++ GGML_API struct ggml_tensor * ggml_ssm_conv_update_inplace_ids( ++ struct ggml_context * ctx, ++ struct ggml_tensor * conv_states, ++ struct ggml_tensor * conv_kernel, ++ struct ggml_tensor * x_cur, ++ struct ggml_tensor * conv_state_dst, ++ struct ggml_tensor * ids, ++ int rs_head, ++ bool fuse_silu); ++ + GGML_API struct ggml_tensor * ggml_ssm_scan( + struct ggml_context * ctx, + struct ggml_tensor * s, +diff --git a/ggml/src/ggml-cpu/ops.cpp b/ggml/src/ggml-cpu/ops.cpp +index 07ab9e5..515aae4 100644 +--- a/ggml/src/ggml-cpu/ops.cpp ++++ b/ggml/src/ggml-cpu/ops.cpp +@@ -9580,6 +9580,90 @@ static void ggml_compute_forward_ssm_conv_update_f32( + } + } + ++// Patch 0028: CPU reference for ggml_ssm_conv_update_inplace_ids (mirror of the CUDA ++// ssm_conv_update_ids_f32). Reads each active sequence's prior K-1 taps directly from the FULL conv ++// cache (src[0]) via ids (src[4]) -- identity sequences (ids[s] == rs_head + s) read in place from the ++// destination slot src[3], non-identity from cache[ids[s]] -- computes the depthwise conv with the ++// same ascending-tap FMA order, optionally folds silu, writes the conv output to dst, and writes the ++// 1-token-shifted ring state back in place into src[3]. The window is copied to a local before the ++// write so the identity (read == write slot) case is correct. Threads split over channels. ++static void ggml_compute_forward_ssm_conv_update_ids_f32( ++ const ggml_compute_params * params, ++ ggml_tensor * dst) { ++ const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells] ++ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels] ++ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs] ++ ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target ++ const ggml_tensor * ids = dst->src[4]; // [n_seqs] I32 slot indices (s_copy) ++ ++ const int ith = params->ith; ++ const int nth = params->nth; ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = x_cur->ne[2]; ++ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0; ++ const int32_t rs_head = ggml_get_op_params_i32(dst, 1); ++ ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float)); ++ GGML_ASSERT(ids->type == GGML_TYPE_I32); ++ GGML_ASSERT(d_conv <= 8); ++ ++ const int64_t cache_row_stride = conv_states->nb[2] / sizeof(float); // (K-1)*channels ++ const int64_t w_stride = conv_kernel->nb[1] / sizeof(float); ++ const int64_t x_seq_stride = x_cur->nb[2] / sizeof(float); ++ const int64_t dst_seq_stride = dst->nb[2] / sizeof(float); ++ const int64_t cdst_seq_stride = cdst->nb[1] / sizeof(float); ++ ++ const float * cache_base = (const float *) conv_states->data; ++ const float * w_base = (const float *) conv_kernel->data; ++ const float * x_base = (const float *) x_cur->data; ++ float * cdst_base = (float *) cdst->data; ++ float * dst_base = (float *) dst->data; ++ const int32_t * ids_base = (const int32_t *) ids->data; ++ ++ const int64_t dc = (channels + nth - 1) / nth; ++ const int64_t c0 = dc * ith; ++ const int64_t c1 = MIN(c0 + dc, channels); ++ ++ for (int64_t s = 0; s < n_seqs; ++s) { ++ const int32_t r = ids_base[s]; ++ const bool ident = (r == rs_head + (int32_t) s); ++ // identity reads the K-1 taps in place from the destination slot; non-identity from cache[r]. ++ const float * states_seq = ident ++ ? (cdst_base + s * cdst_seq_stride) ++ : (cache_base + (int64_t) r * cache_row_stride); ++ for (int64_t c = c0; c < c1; ++c) { ++ const float * states_c = states_seq + c * (d_conv - 1); ++ const float * w_c = w_base + c * w_stride; ++ const float xc = x_base[s * x_seq_stride + c]; ++ ++ // window = [tap0 .. tap_{K-2}, xc], copied to a local before the (possibly aliasing) write ++ float window[8]; ++ for (int64_t j = 0; j < d_conv - 1; ++j) { ++ window[j] = states_c[j]; ++ } ++ window[d_conv - 1] = xc; ++ ++ // ascending-tap FMA: tap0*w0 + ... + tap_{K-2}*w_{K-2} + xc*w_{K-1} (matches ssm_conv) ++ float sumf = 0.0f; ++ for (int64_t j = 0; j < d_conv; ++j) { ++ sumf += window[j] * w_c[j]; ++ } ++ sumf += 0.0f; // matches ssm_conv `sumf += b` with b == 0 ++ ++ dst_base[s * dst_seq_stride + c] = apply_silu ? (sumf / (1.0f + expf(-sumf))) : sumf; ++ ++ // 1-token-shifted ring write-back: [tap1 .. tap_{K-2}, xc] ++ float * out_state = cdst_base + s * cdst_seq_stride + c * (d_conv - 1); ++ for (int64_t j = 0; j < d_conv - 1; ++j) { ++ out_state[j] = window[j + 1]; ++ } ++ } ++ } ++} ++ + void ggml_compute_forward_ssm_conv( + const ggml_compute_params * params, + ggml_tensor * dst) { +@@ -9587,7 +9671,11 @@ void ggml_compute_forward_ssm_conv( + case GGML_TYPE_F32: + { + if (dst->src[3] != nullptr) { +- ggml_compute_forward_ssm_conv_update_f32(params, dst); ++ if (dst->src[4] != nullptr) { ++ ggml_compute_forward_ssm_conv_update_ids_f32(params, dst); ++ } else { ++ ggml_compute_forward_ssm_conv_update_f32(params, dst); ++ } + } else { + ggml_compute_forward_ssm_conv_f32(params, dst); + } +diff --git a/ggml/src/ggml-cuda/ssm-conv.cu b/ggml/src/ggml-cuda/ssm-conv.cu +index e1af1cd..28b3cce 100644 +--- a/ggml/src/ggml-cuda/ssm-conv.cu ++++ b/ggml/src/ggml-cuda/ssm-conv.cu +@@ -226,6 +226,153 @@ static void ggml_cuda_op_ssm_conv_update(ggml_backend_cuda_context & ctx, ggml_t + } + } + ++// Patch 0028: gather only the NON-identity sequences' prior conv taps from the FULL conv cache into a ++// disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the ++// destination slot by the update kernel and are skipped here. One block per sequence. Mirrors ++// gdn_gather_nonident_kernel (the 0019 recurrent-state gather fusion). ++static __global__ void ssm_conv_gather_nonident_kernel(const float * __restrict__ cache, ++ const int32_t * __restrict__ ids, int rs_head, ++ float * __restrict__ scratch, int row_stride, int n_seqs) { ++ const int s = blockIdx.x; ++ if (s >= n_seqs) { ++ return; ++ } ++ const int r = ids[s]; ++ if (r == rs_head + s) { ++ return; // identity: prior taps already live in the in-place destination slot ++ } ++ const float * src = cache + (int64_t) r * row_stride; ++ float * dst = scratch + (int64_t) s * row_stride; ++ for (int i = threadIdx.x; i < row_stride; i += blockDim.x) { ++ dst[i] = src[i]; ++ } ++} ++ ++// Patch 0028: gather-free fused conv update. Per (channel, sequence), read the K-1 prior taps from the ++// active sequence's cache slot via ids -- identity (ids[s] == rs_head + s) reads in place from ++// conv_state_dst (the same slot it writes; the whole window is loaded into registers before any write, ++// so it is race-free), non-identity reads the pre-gathered disjoint scratch -- then computes the ++// depthwise conv with the SAME ascending-tap FMA order as ssm_conv_update_f32, folds silu, writes the ++// conv output, and writes the 1-token-shifted ring state back in place. Bit-identical to the get_rows + ++// ssm_conv_update_f32 path: the read VALUES are the same; only the read POINTER changes. ++template ++static __global__ void ssm_conv_update_ids_f32(const float * __restrict__ nonident_scratch, ++ const float * __restrict__ conv_kernel, ++ const float * __restrict__ x_cur, ++ float * __restrict__ conv_state_dst, ++ float * __restrict__ dst, ++ const int32_t * __restrict__ ids, ++ const int rs_head, ++ const int channels, ++ const int scratch_seq_stride, ++ const int w_stride, ++ const int x_seq_stride, ++ const int dst_seq_stride, ++ const int cdst_seq_stride) { ++ const int c = blockIdx.x * blockDim.x + threadIdx.x; // channel ++ const int s = blockIdx.y; // sequence ++ if (c >= channels) { ++ return; ++ } ++ ++ const bool ident = (ids[s] == rs_head + s); ++ const float * states_c = ident ++ ? conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1) ++ : nonident_scratch + (int64_t) s * scratch_seq_stride + (int64_t) c * (d_conv - 1); ++ const float * w_c = conv_kernel + (int64_t) c * w_stride; ++ const float xc = x_cur[(int64_t) s * x_seq_stride + c]; ++ ++ // window = [tap0 .. tap_{K-2}, current-token], same ordering as ssm_conv_update_f32 ++ float window[d_conv]; ++#pragma unroll ++ for (int j = 0; j < d_conv - 1; j++) { ++ window[j] = states_c[j]; ++ } ++ window[d_conv - 1] = xc; ++ ++ float sumf = 0.0f; ++#pragma unroll ++ for (int j = 0; j < d_conv; j++) { ++ sumf += window[j] * w_c[j]; ++ } ++ sumf += 0.0f; // matches ssm_conv_f32 `sumf += b` with b == 0 (qwen35 conv1d has no bias) ++ dst[(int64_t) s * dst_seq_stride + c] = apply_silu ? ggml_cuda_op_silu_single(sumf) : sumf; ++ ++ // 1-token-shifted ring write-back: drop the oldest tap, append the current token ++ float * out_state = conv_state_dst + (int64_t) s * cdst_seq_stride + (int64_t) c * (d_conv - 1); ++#pragma unroll ++ for (int j = 0; j < d_conv - 1; j++) { ++ out_state[j] = window[j + 1]; ++ } ++} ++ ++static void ggml_cuda_op_ssm_conv_update_ids(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { ++ const ggml_tensor * conv_states = dst->src[0]; // FULL cache [K-1, channels, n_cells] ++ const ggml_tensor * conv_kernel = dst->src[1]; // [K, channels] ++ const ggml_tensor * x_cur = dst->src[2]; // [channels, 1, n_seqs] ++ const ggml_tensor * cdst = dst->src[3]; // [(K-1)*channels, n_seqs] in-place ring target ++ const ggml_tensor * ids = dst->src[4]; // [n_seqs] I32 slot indices (s_copy) ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = x_cur->ne[2]; ++ const bool apply_silu = ggml_get_op_params_i32(dst, 0) != 0; ++ const int rs_head = ggml_get_op_params_i32(dst, 1); ++ ++ GGML_ASSERT(conv_states->type == GGML_TYPE_F32 && conv_kernel->type == GGML_TYPE_F32); ++ GGML_ASSERT(x_cur->type == GGML_TYPE_F32 && cdst->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32); ++ GGML_ASSERT(ids->type == GGML_TYPE_I32); ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ GGML_ASSERT(conv_states->nb[1] == (size_t) (d_conv - 1) * sizeof(float)); ++ GGML_ASSERT(conv_kernel->nb[0] == sizeof(float)); ++ GGML_ASSERT(dst->ne[0] == channels && dst->ne[1] == 1 && dst->ne[2] == n_seqs); ++ ++ const float * cache_d = (const float *) conv_states->data; ++ const float * w_d = (const float *) conv_kernel->data; ++ const float * x_d = (const float *) x_cur->data; ++ float * cdst_d = (float *) cdst->data; ++ float * dst_d = (float *) dst->data; ++ const int32_t * ids_d = (const int32_t *) ids->data; ++ cudaStream_t stream = ctx.stream(); ++ ++ // n_embd_r = (K-1)*channels: the per-cell row stride of the full conv cache. ++ const int cache_row_stride = (int) (conv_states->nb[2] / sizeof(float)); ++ const int w_stride = (int) (conv_kernel->nb[1] / sizeof(float)); ++ const int x_seq_stride = (int) (x_cur->nb[2] / sizeof(float)); ++ const int dst_seq_stride = (int) (dst->nb[2] / sizeof(float)); ++ const int cdst_seq_stride = (int) (cdst->nb[1] / sizeof(float)); ++ ++ // Gather only the non-identity sequences' prior taps into a disjoint scratch (identity sequences ++ // read in place from cdst). The scratch is written here and read-only by the update kernel, so the ++ // update kernel never reads a slot another block writes -> race-free. No-op at steady AR decode. ++ ggml_cuda_pool_alloc nonident_scratch(ctx.pool()); ++ float * scratch = nonident_scratch.alloc((size_t) cache_row_stride * n_seqs); ++ if (n_seqs > 0) { ++ ssm_conv_gather_nonident_kernel<<<(unsigned) n_seqs, 256, 0, stream>>>( ++ cache_d, ids_d, rs_head, scratch, cache_row_stride, (int) n_seqs); ++ } ++ ++ const int threads = 128; ++ const dim3 blocks((channels + threads - 1) / threads, (unsigned) n_seqs, 1); ++ ++ auto launch = [&](auto NC) { ++ constexpr int kNC = decltype(NC)::value; ++ if (apply_silu) { ++ ssm_conv_update_ids_f32<<>>(scratch, w_d, x_d, cdst_d, dst_d, ++ ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride); ++ } else { ++ ssm_conv_update_ids_f32<<>>(scratch, w_d, x_d, cdst_d, dst_d, ++ ids_d, rs_head, (int) channels, cache_row_stride, w_stride, x_seq_stride, dst_seq_stride, cdst_seq_stride); ++ } ++ }; ++ ++ switch (d_conv) { ++ case 3: launch(std::integral_constant{}); break; ++ case 4: launch(std::integral_constant{}); break; ++ default: GGML_ABORT("ssm_conv_update_ids only supports d_conv 3 or 4"); ++ } ++} ++ + template + static void ssm_conv_f32_cuda(const float * src0, const float * src1, const float * bias, const int src0_nb0, const int src0_nb1, + const int src0_nb2, const int src1_nb1, float * dst, const int dst_nb0, const int dst_nb1, +@@ -266,7 +413,13 @@ void ggml_cuda_op_ssm_conv(ggml_backend_cuda_context & ctx, ggml_tensor * dst, g + // silu of the decode conv path into a single kernel. + if (dst->src[3] != nullptr) { + GGML_ASSERT(bias_add_node == nullptr && silu_dst == nullptr); +- ggml_cuda_op_ssm_conv_update(ctx, dst); ++ // Patch 0028: a non-null src[4] (ids) selects the gather-free variant that reads each ++ // sequence's prior taps directly from the full cache via ids (no get_rows materialization). ++ if (dst->src[4] != nullptr) { ++ ggml_cuda_op_ssm_conv_update_ids(ctx, dst); ++ } else { ++ ggml_cuda_op_ssm_conv_update(ctx, dst); ++ } + return; + } + +diff --git a/ggml/src/ggml.c b/ggml/src/ggml.c +index 16b180f..dcc09bd 100644 +--- a/ggml/src/ggml.c ++++ b/ggml/src/ggml.c +@@ -5606,6 +5606,68 @@ struct ggml_tensor * ggml_ssm_conv_update_inplace( + return result; + } + ++// ggml_ssm_conv_update_inplace_ids ++// ++// Gather-free variant of ggml_ssm_conv_update_inplace (patch 0028). Instead of a pre-gathered ++// per-sequence tap scratch, it takes the FULL conv-state cache (`conv_states` = [K-1, channels, ++// n_cells]) plus the per-sequence `ids` (the recurrent-state s_copy) and reads each active sequence's ++// prior taps directly from cache[ids[s]] inside the kernel (no ggml_get_rows). Identity sequences ++// (ids[s] == rs_head + s) read in place from the `conv_state_dst` write slot; non-identity sequences ++// are gathered into a disjoint scratch by the backend first. Bit-identical to the get_rows + ++// ggml_ssm_conv_update_inplace path. Reuses GGML_OP_SSM_CONV, discriminated by a non-null src[4]. ++// op_params[1] carries rs_head. Mirrors the 0019 ggml_gated_delta_net_inplace_ids gather fusion. ++struct ggml_tensor * ggml_ssm_conv_update_inplace_ids( ++ struct ggml_context * ctx, ++ struct ggml_tensor * conv_states, ++ struct ggml_tensor * conv_kernel, ++ struct ggml_tensor * x_cur, ++ struct ggml_tensor * conv_state_dst, ++ struct ggml_tensor * ids, ++ int rs_head, ++ bool fuse_silu) { ++ GGML_ASSERT(ggml_is_3d(conv_states)); ++ GGML_ASSERT(ggml_is_matrix(conv_kernel)); ++ GGML_ASSERT(ggml_is_3d(x_cur)); ++ GGML_ASSERT(ids != NULL && ids->type == GGML_TYPE_I32); ++ ++ const int64_t d_conv = conv_kernel->ne[0]; ++ const int64_t channels = conv_kernel->ne[1]; ++ const int64_t n_seqs = x_cur->ne[2]; ++ ++ GGML_ASSERT(conv_states->type == GGML_TYPE_F32); ++ GGML_ASSERT(conv_kernel->type == GGML_TYPE_F32); ++ GGML_ASSERT(x_cur->type == GGML_TYPE_F32); ++ GGML_ASSERT(conv_state_dst != NULL && conv_state_dst->type == GGML_TYPE_F32); ++ ++ // conv_states: FULL cache [K-1, channels, n_cells], contiguous taps per channel ++ GGML_ASSERT(conv_states->ne[0] == d_conv - 1); ++ GGML_ASSERT(conv_states->ne[1] == channels); ++ GGML_ASSERT(conv_states->nb[0] == sizeof(float)); ++ // x_cur: single decode token per sequence ++ GGML_ASSERT(x_cur->ne[0] == channels); ++ GGML_ASSERT(x_cur->ne[1] == 1); ++ // ids: one slot index per active sequence ++ GGML_ASSERT(ids->ne[0] == n_seqs); ++ // conv_state_dst: [(K-1)*channels, n_seqs] in-place ring write target ++ GGML_ASSERT(conv_state_dst->ne[0] == (d_conv - 1) * channels); ++ GGML_ASSERT(conv_state_dst->ne[1] >= n_seqs); ++ GGML_ASSERT(conv_state_dst->nb[0] == sizeof(float)); ++ ++ struct ggml_tensor * result = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs); ++ ++ ggml_set_op_params_i32(result, 0, fuse_silu ? 1 : 0); ++ ggml_set_op_params_i32(result, 1, rs_head); ++ ++ result->op = GGML_OP_SSM_CONV; ++ result->src[0] = conv_states; ++ result->src[1] = conv_kernel; ++ result->src[2] = x_cur; ++ result->src[3] = conv_state_dst; ++ result->src[4] = ids; ++ ++ return result; ++} ++ + // ggml_ssm_scan + + struct ggml_tensor * ggml_ssm_scan( +diff --git a/src/models/delta-net-base.cpp b/src/models/delta-net-base.cpp +index 58f3d0c..962f5eb 100644 +--- a/src/models/delta-net-base.cpp ++++ b/src/models/delta-net-base.cpp +@@ -548,25 +548,33 @@ ggml_tensor * llm_build_delta_net_base::build_conv_state_fused( + GGML_ASSERT(n_seq_tokens == 1); // single-token decode only + GGML_ASSERT(cparams.n_rs_seq == 0); // no rollback splits on this path + +- // Prior conv-state taps for the active sequences: [K-1, conv_channels, n_seqs]. Same get_rows +- // gather as the baseline build_conv_state read (tiny; not one of the eliminated buckets). +- ggml_tensor * conv_states = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs); +- conv_states = ggml_reshape_3d(ctx0, conv_states, conv_kernel_size - 1, conv_channels, n_seqs); +- cb(conv_states, "conv_states_reshaped", il); +- + // Current token, native (non-transposed) qkv_mixed: [conv_channels, 1, n_seqs]. + ggml_tensor * x_cur = ggml_reshape_3d(ctx0, qkv_mixed, conv_channels, n_seq_tokens, n_seqs); + + // In-place ring write-back target = the active sequences' conv-cache slot at kv_head, exactly the + // destination the baseline ggml_cpy wrote to (s_slot == 0). +- const int64_t row_count = (conv_kernel_size - 1) * conv_channels; ++ const int64_t row_count = (conv_kernel_size - 1) * conv_channels; // = n_embd_r + const size_t row_size = ggml_row_size(conv_states_all->type, row_count); + ggml_tensor * conv_state_dst = + ggml_view_2d(ctx0, conv_states_all, row_count, n_seqs, conv_states_all->nb[1], kv_head * row_size); + cb(conv_state_dst, "conv_state_update", il); + +- ggml_tensor * conv_output = +- ggml_ssm_conv_update_inplace(ctx0, conv_states, conv_kernel, x_cur, conv_state_dst, /*fuse_silu=*/true); ++ // Patch 0028: fuse the residual conv-state tap gather (the k_get_rows that build_conv_state's ++ // build_rs left firing -- ~the biggest single residual decode kernel, see MOE_GAP_VS_VLLM.md). ++ // Exactly like the 0019 SSM-state gather fusion, build_rs feeds the FULL conv cache + the s_copy ++ // ids into the op (via the get_state_rows lambda) and still performs the rs_zero clear and the ++ // extra-states copy around it; the op reads each active sequence's prior taps directly from ++ // cache[ids[s]] (identity sequences read in place from conv_state_dst), so the separate ++ // ggml_get_rows materialization is eliminated. The read VALUES are unchanged, only the read path ++ // (gather -> indexed in-kernel read) changes, so it is bit-identical to the build_rs gather. ++ auto get_conv_op = [&](ggml_context * ctx, ggml_tensor * states, ggml_tensor * ids) -> ggml_tensor * { ++ // states = full conv-state cache reshaped 2d [n_embd_r, n_cells] ++ ggml_tensor * cache3d = ggml_reshape_3d(ctx, states, conv_kernel_size - 1, conv_channels, states->ne[1]); ++ return ggml_ssm_conv_update_inplace_ids(ctx, cache3d, conv_kernel, x_cur, conv_state_dst, ++ ids, (int) kv_head, /*fuse_silu=*/true); ++ }; ++ ++ ggml_tensor * conv_output = build_rs(inp, conv_states_all, hparams.n_embd_r(), n_seqs, get_conv_op); + cb(conv_output, "conv_output_silu", il); + + // the ring write is a side effect of the op; pull the op into the graph via the output +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index b5e3048..302975f 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -3793,6 +3793,65 @@ struct test_ssm_conv_update : public test_case { + } + }; + ++// GGML_OP_SSM_CONV gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids, ++// patch 0028). conv_states is the FULL cache; ids (a shuffled permutation of [0,n_seqs), rs_head=0) ++// selects each sequence's slot, exercising BOTH the identity in-place read (ids[s]==s) and the ++// non-identity cache read. Validates the conv + silu output (dst) against the CPU reference. ++struct test_ssm_conv_update_ids : public test_case { ++ const int64_t d_conv; ++ const int64_t channels; ++ const int64_t n_seqs; ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "SSM_CONV_UPDATE_IDS"; ++ } ++ ++ std::string vars() override { ++ return VARS_TO_STR3(d_conv, channels, n_seqs); ++ } ++ ++ test_ssm_conv_update_ids(int64_t d_conv = 4, int64_t channels = 256, int64_t n_seqs = 4) ++ : d_conv(d_conv), channels(channels), n_seqs(n_seqs) {} ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ ggml_tensor * conv_states = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, d_conv - 1, channels, n_seqs); ++ ggml_tensor * conv_kernel = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, d_conv, channels); ++ ggml_tensor * x_cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, channels, 1, n_seqs); ++ ggml_tensor * conv_state_dst = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, (d_conv - 1) * channels, n_seqs); ++ ggml_tensor * ids = ggml_new_tensor_1d(ctx, GGML_TYPE_I32, n_seqs); ++ ggml_set_name(conv_states, "conv_states"); ++ ggml_set_name(conv_kernel, "conv_kernel"); ++ ggml_set_name(x_cur, "x_cur"); ++ ggml_set_name(conv_state_dst, "conv_state_dst"); ++ ggml_set_name(ids, "ids"); ++ ++ ggml_tensor * out = ggml_ssm_conv_update_inplace_ids(ctx, conv_states, conv_kernel, x_cur, ++ conv_state_dst, ids, /*rs_head=*/0, /*fuse_silu=*/true); ++ ggml_set_name(out, "out"); ++ return out; ++ } ++ ++ void initialize_tensors(ggml_context * ctx) override { ++ std::random_device rd; ++ std::default_random_engine rng(rd()); ++ for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != NULL; t = ggml_get_next_tensor(ctx, t)) { ++ if (t->type == GGML_TYPE_I32) { ++ // ids: shuffled permutation of [0, n_seqs) into the full cache (rs_head == 0), so some ++ // sequences are identity (ids[s] == s, in-place read) and some are not (scratch read). ++ std::vector data(t->ne[0]); ++ for (int i = 0; i < t->ne[0]; i++) { ++ data[i] = i; ++ } ++ std::shuffle(data.begin(), data.end(), rng); ++ ggml_backend_tensor_set(t, data.data(), 0, t->ne[0] * sizeof(int32_t)); ++ } else { ++ init_tensor_uniform(t); ++ } ++ } ++ } ++}; ++ + // GGML_OP_SSM_SCAN + struct test_ssm_scan : public test_case { + const ggml_type type; +@@ -8504,6 +8563,16 @@ static std::vector> make_test_cases_eval() { + } + } + ++ // gather-free fused decode conv-update via ids (ggml_ssm_conv_update_inplace_ids, patch 0028). ++ // channels must be a multiple of 128 for the CUDA SSM_CONV supports_op gate. ++ for (int64_t d_conv : {3, 4}) { ++ for (int64_t channels : {256, 3328}) { ++ for (int64_t n_seqs : {1, 4, 32, 128}) { ++ test_cases.emplace_back(new test_ssm_conv_update_ids(d_conv, channels, n_seqs)); ++ } ++ } ++ } ++ + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 16, 1, 1024, 1, 32, 4)); // Mamba-1 + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 128, 64, 16, 2, 32, 4)); // Mamba-2 + test_cases.emplace_back(new test_ssm_scan(GGML_TYPE_F32, 256, 64, 8, 2, 32, 4)); // Falcon-H1 +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0029-qwen35-blocktable-within-step-cache.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0029-qwen35-blocktable-within-step-cache.patch new file mode 100644 index 000000000000..98a085af37af --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0029-qwen35-blocktable-within-step-cache.patch @@ -0,0 +1,176 @@ +From e2acb3bca4d12ecef4964a214d397fc91ecfcebc Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sat, 27 Jun 2026 03:45:19 +0200 +Subject: [PATCH] feat(paged): block-table within-step host cache (patch 0029) + +Lever 5 (host pipeline). get_block_table() is called once per full-attention +layer per decode step, but the KV cell layout (and therefore the block table) +is fixed for the whole step: it only changes in apply() when the ubatch's slots +are committed. The old path recomputed the full table on every layer. + +This caches the table the first time it is built in a step and reuses the bytes +(memcpy) for every subsequent full-attention layer, invalidating the cache in +apply(). The reused bytes are identical to a fresh compute, so the change is +bit-exact. Toggle off with LLAMA_PAGED_NO_BT_CACHE=1. + +Measured host-side get_block_table time (llama-batched-bench, npp128 ntg128 +npl128, cache OFF -> ON): +- MoE q36-35b-a3b-nvfp4: 112.94 -> 14.82 ms (-87%) +- dense q36-27b-nvfp4 : 193.78 -> 16.90 ms (-91%) + +Throughput: dense is partly host-bound and gains (TG 364.8 -> 374.7 t/s, ++2.7%, ~95.8% of the vLLM 391 t/s reference @npl128). MoE decode is compute- +bound (FP4 GEMM dominates) so the saved host time is off the critical path and +TG is flat (752.2 -> 757.0 t/s). The cache is therefore a pure pipeline cleanup, +not a numeric change. + +Bit-exact, per path (llama-completion --temp 0 --seed 1, 48 tok): +- non-paged MoE = 07db32c2bcb78d17a43ed18bc22705cd (unchanged baseline) +- paged MoE = 8cb0ce23777bf55f92f63d0292c756b0 (paged baseline) +- paged MoE cache OFF == cache ON (both 8cb0ce23) +- dense non-paged == dense paged = 5951a5b4d624ce891e22ab5fca9bc439 + +The paged-MoE md5 (8cb0ce23) differs from the non-paged md5 (07db32c2) by a +benign FP-accumulation-order difference of the paged attention reduction, not a +bug: KL-divergence vs the f16 reference (16 chunks, c512) gives KLD(paged||f16) += 0.13600 <= KLD(nonpaged||f16) = 0.13660 and PPL(paged) = 7.4009 ~ +PPL(nonpaged) = 7.3896 (within +/- 0.29). See PAGED_BITEXACT_NOTE.md and +LEVER5_HOSTPIPE_RESULTS.md. + +Includes the [L5INSTR] host-timing instrumentation used to measure the lever. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/llama-context.cpp | 7 +++++++ + src/llama-kv-cache.cpp | 28 +++++++++++++++++++++++++++- + src/llama-kv-cache.h | 9 +++++++++ + src/paged-attn.cpp | 9 +++++++++ + 4 files changed, 52 insertions(+), 1 deletion(-) + +diff --git a/src/llama-context.cpp b/src/llama-context.cpp +index 5c90c48..ad7939e 100644 +--- a/src/llama-context.cpp ++++ b/src/llama-context.cpp +@@ -1306,7 +1306,11 @@ bool llama_context::set_adapter_cvec( + return res; + } + ++extern "C" void l5_add_setinp(double ns); ++extern "C" void l5_add_hostproc(double ns); ++static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } + llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) { ++ double _l5_t0=l5c_now_ns(); + if (mctx && !mctx->apply()) { + LLAMA_LOG_ERROR("%s: failed to apply memory context\n", __func__); + ret = GGML_STATUS_FAILED; +@@ -1361,11 +1365,14 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll + //const auto t_start_us = ggml_time_us(); + + // FIXME this call causes a crash if any model inputs were not used in the graph and were therefore not allocated ++ double _l5_si=l5c_now_ns(); + res->set_inputs(&ubatch); ++ l5_add_setinp(l5c_now_ns()-_l5_si); + + //LLAMA_LOG_INFO("graph set inputs time: %.3f ms\n", (ggml_time_us() - t_start_us)/1000.0); + } + ++ l5_add_hostproc(l5c_now_ns()-_l5_t0); + const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1); + if (status != GGML_STATUS_SUCCESS) { + LLAMA_LOG_ERROR("%s: failed to compute graph, compute status: %d\n", __func__, status); +diff --git a/src/llama-kv-cache.cpp b/src/llama-kv-cache.cpp +index 21b8f1e..17aaf40 100644 +--- a/src/llama-kv-cache.cpp ++++ b/src/llama-kv-cache.cpp +@@ -2772,6 +2772,9 @@ bool llama_kv_cache_context::apply() { + kv->apply_ubatch(sinfos[i_cur], ubatches[i_cur]); + n_kv = kv->get_n_kv(sinfos[i_cur]); + ++ // the cells for this ubatch just changed -> drop the cached block table ++ bt_cache_valid = false; ++ + return true; + } + +@@ -2814,7 +2817,30 @@ void llama_kv_cache_context::get_gather_idxs(int32_t * dst) const { + } + + void llama_kv_cache_context::get_block_table(int32_t * dst, uint32_t n_blk) const { +- kv->get_block_table(dst, n_blk, n_kv, sinfos[i_cur]); ++ const auto & sinfo = sinfos[i_cur]; ++ const uint32_t ns = sinfo.s1 - sinfo.s0 + 1; ++ const size_t total = (size_t) ns * n_blk; ++ ++ // within-step reuse: all full-attention layers of a step request the same ++ // table (same i_cur/n_blk, cells fixed since apply()). The bytes are ++ // identical to a fresh compute, so this is bit-exact. ++ static const bool nocache = (getenv("LLAMA_PAGED_NO_BT_CACHE") != nullptr); ++ if (nocache) { ++ kv->get_block_table(dst, n_blk, n_kv, sinfo); ++ return; ++ } ++ ++ if (bt_cache_valid && bt_cache_n_blk == n_blk && bt_cache.size() == total) { ++ memcpy(dst, bt_cache.data(), total * sizeof(int32_t)); ++ return; ++ } ++ ++ kv->get_block_table(dst, n_blk, n_kv, sinfo); ++ ++ bt_cache.resize(total); ++ memcpy(bt_cache.data(), dst, total * sizeof(int32_t)); ++ bt_cache_n_blk = n_blk; ++ bt_cache_valid = true; + } + + ggml_tensor * llama_kv_cache_context::cpy_k(ggml_context * ctx, ggml_tensor * k_cur, ggml_tensor * k_idxs, int32_t il) const { +diff --git a/src/llama-kv-cache.h b/src/llama-kv-cache.h +index e9980b6..b03de78 100644 +--- a/src/llama-kv-cache.h ++++ b/src/llama-kv-cache.h +@@ -451,4 +451,13 @@ private: + // a heuristic, to avoid attending the full cache if it is not yet utilized + // as the cache gets filled, the benefit from this heuristic disappears + int32_t n_kv; ++ ++ // [paged L5] within-step block-table cache. get_block_table() is called once ++ // per full-attention layer per decode step, but the cell layout (and hence ++ // the table) is identical across all layers of a step. Compute it on the ++ // first call and reuse the bytes for the rest; invalidated in apply() when ++ // the ubatch's slots are committed (the only host-side mutation per step). ++ mutable std::vector bt_cache; ++ mutable uint32_t bt_cache_n_blk = 0; ++ mutable bool bt_cache_valid = false; + }; +diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp +index fed8ca9..ebd92be 100644 +--- a/src/paged-attn.cpp ++++ b/src/paged-attn.cpp +@@ -8,6 +8,13 @@ + + #include + #include ++#include ++namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } } ++double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0; ++extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; } ++extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; } ++namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; } ++ + + namespace paged_attn { + +@@ -54,7 +61,9 @@ public: + void set_input(const llama_ubatch * ubatch) override { + GGML_UNUSED(ubatch); + GGML_ASSERT(idxs && ggml_backend_buffer_is_host(idxs->buffer)); ++ double _t=l5_now_ns(); + mctx->get_block_table((int32_t *) idxs->data, n_blk); ++ g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++; + } + + const llama_kv_cache_context * mctx; +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0030-fused-op-backend-gate.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0030-fused-op-backend-gate.patch new file mode 100644 index 000000000000..8d3ad8f432da --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0030-fused-op-backend-gate.patch @@ -0,0 +1,106 @@ +From a095f4ebeefafd16dd54c514eb86148fa46daef3 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sat, 27 Jun 2026 07:30:43 +0000 +Subject: [PATCH] feat(paged): backend-gate fused GDN/discriminated SSM_CONV + emission (patch 0030) + +Closes the latent silent-miscompute hazard (audit RISKY-1). The fused/in-place +Gated Delta Net op (0018/0019/0026: ggml_gated_delta_net_inplace[_ids][_hybrid]) +and the discriminated SSM_CONV decode op (0021/0028: ggml_ssm_conv_update_inplace +[_ids], which REUSE GGML_OP_SSM_CONV / GGML_OP_GATED_DELTA_NET with extra src +slots - a non-null src[3]/src[4] ring/ids discriminator) are emitted DEFAULT-ON +(cparams.fused_gdn_ar/ch=true, auto_fgdn=true) but are implemented for the +CUDA-family TU (CUDA / HIP "ROCm" / "MUSA", hipified ggml-cuda) and the CPU +reference ONLY. + +The hazard: a compute backend that supports PLAIN GGML_OP_SSM_CONV but ignores +the src[3]/src[4] discriminator (Vulkan/SYCL/Metal) reports supports_op==true for +the node and the scheduler assigns the discriminated conv to it; it then runs the +wrong plain conv => SILENT corruption (not a crash). The upstream auto_fgdn +device-mismatch resolution only inspects GATED_DELTA_NET nodes, so the +discriminated-SSM_CONV safety was only incidentally covered (it happened to share +backend coverage with the GDN op); it becomes live the moment a non-CUDA paged +build of a gated-DeltaNet model exists. + +FIX: gate the fused-op emission on the active compute backend type. Before the +auto_fgdn resolution in llama_context::sched_reserve(), if any non-CPU compute +backend is not CUDA-family (reg name != "CUDA"/"ROCm"/"MUSA"), force +fused_gdn_ar = fused_gdn_ch = auto_fgdn = false. Every emission site keys off +these flags (conv_decode_fused = ... && fused_gdn_ar; fused = ... fused_gdn_ar/ch), +so disabling them routes the graph to the upstream non-fused path: a PLAIN +ggml_ssm_conv (no discriminator) + ggml_silu, which every backend handles +correctly. This makes the discriminated-op safety explicit and decoupled from the +GDN-op device-mismatch heuristic. + +INVARIANT (CUDA byte-identical): on a CUDA backend the reg name is "CUDA", so +fgdn_backend_ok stays true, the flags are left untouched, and the emitted decode +graph is unchanged - byte-identical to pre-0030. The fix only changes behavior on +non-CUDA/non-CPU backends. + +GATE compile: CPU-only build (GGML_CUDA=OFF) of the full series (pin 9d5d882d + +0001-0029 + this) links libllama.so and test-backend-ops with 0 errors; the +edited llama-context.cpp compiles clean (uses only already-included + +backend-reg API already used in this TU). test-backend-ops correctness for +SSM_CONV / SSM_CONV_UPDATE / SSM_CONV_UPDATE_IDS / GATED_DELTA_NET is a +CUDA0-vs-CPU comparison (CPU-only run skips CPU-vs-CPU); the test cases are +registered and exercised on the CUDA DGX run. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/llama-context.cpp | 39 +++++++++++++++++++++++++++++++++++++++ + 1 file changed, 39 insertions(+) + +diff --git a/src/llama-context.cpp b/src/llama-context.cpp +index ad7939e..c408eef 100644 +--- a/src/llama-context.cpp ++++ b/src/llama-context.cpp +@@ -521,6 +521,45 @@ void llama_context::sched_reserve() { + cparams.auto_fa = false; + } + ++ // RISKY-1 guard: the fused/in-place Gated Delta Net op and the discriminated ++ // SSM_CONV (which reuse GGML_OP_GATED_DELTA_NET / GGML_OP_SSM_CONV with extra ++ // src slots - a non-null src[3]/src[4] ring/ids discriminator) are only ++ // implemented for the CUDA-family backends (CUDA / HIP "ROCm" / "MUSA" - all ++ // built from the hipified ggml-cuda TU) and the CPU reference. Any other ++ // compute backend (Vulkan/SYCL/Metal/...) that supports *plain* SSM_CONV but ++ // ignores the discriminator src would silently run the WRONG conv. The ++ // upstream auto_fgdn device-mismatch check below only inspects ++ // GATED_DELTA_NET nodes, so couple the discriminated-SSM_CONV safety ++ // explicitly to the backend type here: keep the fused path enabled only when ++ // every non-CPU compute backend is CUDA-family. On CUDA this leaves the flags ++ // untouched, so the emitted decode graph is byte-identical. ++ if (cparams.fused_gdn_ar || cparams.fused_gdn_ch) { ++ bool fgdn_backend_ok = true; ++ for (auto & backend : backends) { ++ ggml_backend_dev_t dev = ggml_backend_get_device(backend.get()); ++ if (!dev || ggml_backend_dev_type(dev) == GGML_BACKEND_DEVICE_TYPE_CPU) { ++ // CPU reference handles the fused/discriminated ops ++ continue; ++ } ++ ggml_backend_reg_t reg = ggml_backend_dev_backend_reg(dev); ++ const char * name = reg ? ggml_backend_reg_name(reg) : ""; ++ // GGML_CUDA_NAME is "CUDA" / "ROCm" (HIP) / "MUSA"; all three are the ++ // same ggml-cuda TU that carries the discriminated-op handling. ++ if (strcmp(name, "CUDA") != 0 && strcmp(name, "ROCm") != 0 && strcmp(name, "MUSA") != 0) { ++ fgdn_backend_ok = false; ++ break; ++ } ++ } ++ ++ if (!fgdn_backend_ok) { ++ cparams.fused_gdn_ar = false; ++ cparams.fused_gdn_ch = false; ++ cparams.auto_fgdn = false; ++ LLAMA_LOG_INFO("%s: fused Gated Delta Net / discriminated SSM_CONV disabled " ++ "(compute backend is not CUDA/HIP/CPU)\n", __func__); ++ } ++ } ++ + if (cparams.auto_fgdn) { + LLAMA_LOG_INFO("%s: resolving fused Gated Delta Net support:\n", __func__); + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch new file mode 100644 index 000000000000..777fb5fda2aa --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0031-paged-chunked-gdn-prefill-scan-kernel.patch @@ -0,0 +1,357 @@ +From 37549ecce806130b36012dfd0077ad830989ec71 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 19:30:01 +0000 +Subject: [PATCH] feat(paged): chunked parallel-scan GDN prefill kernel (patch + 0031) + +Implements the explicit upstream TODO at gated_delta_net.cu's +launch_gated_delta_net ("Add chunked kernel for even faster pre-fill"). The +stock kernel runs a strictly sequential per-token recurrence (one block per +(head,seq) looping over all n_tokens), so prefill cannot use token-level +parallelism - a confirmed gap versus vLLM, which uses an FLA-style chunked +scan. + +What this adds +-------------- +A chunked parallel-scan prefill path for gated DeltaNet, gated to the +compile-time subset that matters for Qwen3.6 prefill: non-KDA (scalar gate), +f32 state, final-state-only (keep_rs == false), homogeneous (non-hybrid, +non-bf16-state). One block per (head,seq); thread j owns the j-th v-column. +The sequence is split into chunks of C tokens: the inter-chunk recurrence in +the state S stays sequential (n_tokens/C steps instead of n_tokens), while the +intra-chunk gated delta rule is solved in parallel via the FLA chunked form: + + gamma_t = prod_{i<=t} g_i (<=1), d(j,t) = gamma_t / gamma_j in (0,1] + A = I + tril(beta_t d(j,t) (k_t . k_j), -1) [unit lower-tri, C x C] + U = A^{-1} ( beta_t (v_t - gamma_t S0^T k_t) ) (forward substitution) + O_t = gamma_t (S0^T q_t) + sum_{j<=t} d(j,t)(q_t . k_j) u_j (then * scale) + S_C = gamma_C S0 + sum_t d(t,C) k_t u_t^T + +This uses the bounded/stable de-gating (pairwise decays d <= 1, gamma <= 1), so +strong-decay tokens underflow to the correct zero rather than to inf - it is +numerically robust even for the adversarial g in [-20, -1e-4] of the op test. + +Bit-exactness (NEW per-path) +---------------------------- +The chunked form is mathematically equivalent to the sequential recurrence but +reduces in a different FP order, so it is a NEW path (its md5 will not match the +sequential path), gated exactly like the paged-vs-nonpaged precedent. A numpy +prototype confirms f32 chunked-vs-sequential NMSE ~1e-13 (max abs ~1e-7). +test-backend-ops GATED_DELTA_NET is 91/91 (this patch adds 8 S_v=128 prefill +cases: exact-multiple / tail / multi-seq / GQA / permuted), i.e. within the +default 1e-7 NMSE gate versus the CPU reference. + +Disposition: OPT-IN, default OFF (no regression) +------------------------------------------------ +GB10's max dynamic shared-memory opt-in is 99KB, so the all-shared layout that +keeps the 128x128 state resident forces C=16 (89KB). At C=16, with one block / +SM (the 64KB state dominates shared) and serial per-thread dk-reductions, the +kernel is correct but NOT yet faster than the already-tuned sequential +recurrence: measured S_PP on q36-27b-nvfp4 (llama-batched-bench -npp 512 -ntg 4 +-npl 32) is ~761 t/s chunked vs ~971 t/s sequential (~22% slower, also +grid-starved at low n_seqs). It is therefore wired OPT-IN: the default +(no env) keeps the sequential path, and the chunked path is enabled with +GDN_CHUNK_MIN=. The default backend behaviour is unchanged. + +cudaFuncSetAttribute's return is checked (a silent failure when the requested +dynamic smem exceeded the device opt-in left a sticky CUDA error during +bring-up). + +Remaining work to make it a win (recorded for the follow-up): break the 1 +block/SM occupancy ceiling (the 64KB state in shared) and the serial +dk-reductions - either register-resident state with static-unrolled (larger) +chunks, or tensor-core (mma/wgmma) matmuls for the KK/QK/KS/QS/PU products and +the A-inverse, which is what FLA/vLLM use to beat the sequential scan. See +README section 5 (dev notes / rejected-flat levers). + +Assisted-by: Claude:opus-4.8 [Claude Code] +--- + ggml/src/ggml-cuda/gated_delta_net.cu | 237 ++++++++++++++++++++++++++ + tests/test-backend-ops.cpp | 8 + + 2 files changed, 245 insertions(+) + +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index d071d5a..7121d80 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -1,7 +1,10 @@ + #include "gated_delta_net.cuh" + #include "ggml-cuda/common.cuh" + ++#include + #include ++#include ++#include + + // Step 2: gather only the NON-identity sequences' prior recurrent state from the full cache into a + // disjoint scratch buffer. Identity sequences (ids[s] == rs_head + s) are read in place from the +@@ -279,6 +282,219 @@ static void launch_gdn_variant( + sb1, sb2, sb3, neqk1_magic, rq3_magic, scale, K, state_dst_d, ids_d, rs_head); + } + ++// ============================================================================ ++// CHUNKED parallel-scan prefill kernel (upstream TODO: "faster pre-fill"). ++// Scope: non-KDA (scalar gate), f32 state, final-state-only (keep_rs==false), ++// homogeneous (non-hybrid) path. One block per (head, seq); thread j owns the ++// j-th v-column. The sequence is split into chunks of C tokens; the inter-chunk ++// recurrence in S is sequential (n_tokens/C steps instead of n_tokens), and the ++// intra-chunk gated delta rule is solved in parallel via the FLA chunked form: ++// gamma_t = prod_{i<=t} g_i (<=1), d(j,t) = gamma_t / gamma_j in (0,1] ++// A = I + tril(beta_t d(j,t) (k_t . k_j), -1) [Cc x Cc unit lower-tri] ++// U = A^{-1} ( beta_t (v_t - gamma_t S0^T k_t) ) [Cc x dv] (fwd subst) ++// O_t = gamma_t (S0^T q_t) + sum_{j<=t} d(j,t)(q_t . k_j) u_j (then * scale) ++// S_C = gamma_C S0 + sum_t d(t,C) k_t u_t^T ++// This is the bounded/stable de-gating (pairwise decays d <= 1, gamma <= 1), so ++// strong-decay tokens underflow to the correct zero rather than to inf. The math ++// is equivalent to the sequential recurrence up to FP reduction order (a NEW ++// per-path result, validated benign by test-backend-ops NMSE and greedy output). ++template ++__global__ void gated_delta_net_chunked_cuda( ++ const float * __restrict__ q, const float * __restrict__ k, ++ const float * __restrict__ v, const float * __restrict__ g, ++ const float * __restrict__ beta, const float * __restrict__ curr_state, ++ float * __restrict__ dst, ++ int64_t H, int64_t n_tokens, int64_t n_seqs, ++ int64_t sq1, int64_t sq2, int64_t sq3, ++ int64_t sv1, int64_t sv2, int64_t sv3, ++ int64_t sb1, int64_t sb2, int64_t sb3, ++ uint3 neqk1_magic, uint3 rq3_magic, ++ float scale, float * __restrict__ state_dst, ++ const int32_t * __restrict__ ids, int rs_head) { ++ constexpr int dk = S_v; ++ constexpr int dv = S_v; ++ const int h_idx = blockIdx.x; ++ const int seq = blockIdx.y; ++ const int j = threadIdx.x; // this thread's v-column (0..dv-1) ++ ++ const uint32_t iq1 = fastmodulo((uint32_t) h_idx, neqk1_magic); ++ const uint32_t iq3 = fastdiv((uint32_t) seq, rq3_magic); ++ ++ extern __shared__ float gdn_smem[]; ++ float * Sd = gdn_smem; // [dk*dv] M-layout: Sd[col*dk + i] = S[i][col] ++ float * Kc = Sd + (size_t) dk * dv; // [C*dk] Kc[t*dk + i] ++ float * Qc = Kc + (size_t) C * dk; // [C*dk] Qc[t*dk + i] ++ float * Ud = Qc + (size_t) C * dk; // [dv*C] column-major per thread: Ud[col*C + t] ++ float * Amat = Ud + (size_t) dv * C; // [C*C] A / P scratch, row-major Amat[t*C + t'] ++ float * csh = Amat + (size_t) C * C; // [C] cumsum(log-gate) ++ float * gam = csh + C; // [C] gamma_t = exp(cs_t) ++ float * bet = gam + C; // [C] beta_t ++ ++ // S0: thread j owns column j (Sd[j*dk + i]); load is a contiguous per-thread copy from the ++ // M-layout cache view (read_state[j*dk + i] = M[j*S_v + i] = S[i][j]). Same identity/gather ++ // plumbing as the sequential kernel (gather of non-identity seqs done by the dispatcher). ++ const bool identity = (ids != nullptr && ids[seq] == rs_head + seq); ++ const float * read_state = (identity ? state_dst : curr_state) ++ + (int64_t) seq * H * dk * dv + (int64_t) h_idx * dk * dv; ++ for (int i = 0; i < dk; i++) { ++ Sd[j * dk + i] = read_state[j * dk + i]; ++ } ++ ++ const float * q_base = q + iq3 * sq3 + iq1 * sq1; // + t*sq2 + i ++ const float * k_base = k + iq3 * sq3 + iq1 * sq1; ++ const float * v_base = v + seq * sv3 + h_idx * sv1; // + t*sv2 + j ++ const int64_t gb_base = seq * sb3 + h_idx * sb1; // + t*sb2 ++ ++ float * attn_base = dst + (int64_t) (seq * n_tokens * H + h_idx) * S_v; // + tok*S_v*H + j ++ ++ for (int64_t c0 = 0; c0 < n_tokens; c0 += C) { ++ const int Cc = (int) ((n_tokens - c0) < (int64_t) C ? (n_tokens - c0) : (int64_t) C); ++ ++ // --- load chunk K,Q (cooperative), beta and the gate prefix (cs, gamma) --- ++ for (int e = j; e < Cc * dk; e += dv) { ++ const int t = e / dk; ++ const int i = e % dk; ++ Kc[t * dk + i] = k_base[(c0 + t) * sq2 + i]; ++ Qc[t * dk + i] = q_base[(c0 + t) * sq2 + i]; ++ } ++ if (j < Cc) { ++ csh[j] = g[gb_base + (c0 + j) * sb2]; // raw log-gate, prefix-summed below ++ bet[j] = beta[gb_base + (c0 + j) * sb2]; ++ } ++ __syncthreads(); ++ if (j == 0) { ++ float run = 0.0f; ++ for (int t = 0; t < Cc; t++) { ++ run += csh[t]; ++ csh[t] = run; // cs_t = sum_{i<=t} g_i (<= 0) ++ gam[t] = expf(run); // gamma_t (<= 1) ++ } ++ } ++ __syncthreads(); ++ ++ // --- A = I + tril(beta_t * d(t',t) * (k_t . k_t'), -1) (cooperative over C*C) --- ++ for (int e = j; e < Cc * Cc; e += dv) { ++ const int t = e / Cc; ++ const int tp = e % Cc; ++ float a = 0.0f; ++ if (tp < t) { ++ float kk = 0.0f; ++ for (int i = 0; i < dk; i++) { ++ kk += Kc[t * dk + i] * Kc[tp * dk + i]; ++ } ++ const float dd = expf(csh[t] - csh[tp]); // d(tp,t) = gamma_t/gamma_tp ++ a = bet[t] * dd * kk; ++ } else if (tp == t) { ++ a = 1.0f; ++ } ++ Amat[t * Cc + tp] = a; ++ } ++ __syncthreads(); ++ ++ // --- RHS[t][j] = beta_t (v_t[j] - gamma_t * (S0^T k_t)[j]) -> Ud[j*C + t] --- ++ for (int t = 0; t < Cc; t++) { ++ float ks = 0.0f; // (S0^T k_t)[j] = sum_i S[i][j] k_t[i] ++ for (int i = 0; i < dk; i++) { ++ ks += Sd[j * dk + i] * Kc[t * dk + i]; ++ } ++ const float vtj = v_base[(c0 + t) * sv2 + j]; ++ Ud[j * C + t] = bet[t] * (vtj - gam[t] * ks); ++ } ++ ++ // --- solve A U = RHS in place (unit lower-tri fwd subst); per-thread, no inter-step sync --- ++ for (int t = 1; t < Cc; t++) { ++ float acc = Ud[j * C + t]; ++ for (int tp = 0; tp < t; tp++) { ++ acc -= Amat[t * Cc + tp] * Ud[j * C + tp]; ++ } ++ Ud[j * C + t] = acc; ++ } ++ __syncthreads(); // U finalized; Amat free for P below (and Ud read across-thread? no, own col) ++ ++ // --- P[t][t'] = d(t',t) * (q_t . k_t') for t' <= t (reuse Amat) --- ++ for (int e = j; e < Cc * Cc; e += dv) { ++ const int t = e / Cc; ++ const int tp = e % Cc; ++ float p = 0.0f; ++ if (tp <= t) { ++ float qk = 0.0f; ++ for (int i = 0; i < dk; i++) { ++ qk += Qc[t * dk + i] * Kc[tp * dk + i]; ++ } ++ const float dd = expf(csh[t] - csh[tp]); ++ p = dd * qk; ++ } ++ Amat[t * Cc + tp] = p; ++ } ++ __syncthreads(); ++ ++ // --- O[t][j] = gamma_t (S0^T q_t)[j] + sum_{t'<=t} P[t][t'] U[t'][j] (* scale) --- ++ for (int t = 0; t < Cc; t++) { ++ float qs = 0.0f; // (S0^T q_t)[j] (uses pre-update S) ++ for (int i = 0; i < dk; i++) { ++ qs += Sd[j * dk + i] * Qc[t * dk + i]; ++ } ++ float o = gam[t] * qs; ++ for (int tp = 0; tp <= t; tp++) { ++ o += Amat[t * Cc + tp] * Ud[j * C + tp]; ++ } ++ attn_base[(c0 + t) * S_v * H + j] = o * scale; ++ } ++ ++ // --- S_C[i][j] = gamma_{C-1} S[i][j] + sum_t d(t,C-1) k_t[i] u_t[j] --- ++ const float glast = gam[Cc - 1]; ++ const float cslast = csh[Cc - 1]; ++ for (int i = 0; i < dk; i++) { ++ float s = glast * Sd[j * dk + i]; ++ for (int t = 0; t < Cc; t++) { ++ const float dd = expf(cslast - csh[t]); // d(t, last) ++ s += dd * Kc[t * dk + i] * Ud[j * C + t]; ++ } ++ Sd[j * dk + i] = s; ++ } ++ __syncthreads(); // Sd reused as S0 of next chunk; Kc/Qc/Amat reloaded next chunk ++ } ++ ++ // --- final-state write-back (M-layout): in-place cache view or f32 op-output scratch --- ++ const int64_t state_out_offset = (int64_t) (seq * H + h_idx) * S_v * S_v; ++ const int64_t attn_score_elems = (int64_t) S_v * H * n_tokens * n_seqs; ++ float * st = (state_dst != nullptr) ? (state_dst + state_out_offset) ++ : (dst + attn_score_elems + state_out_offset); ++ for (int i = 0; i < dk; i++) { ++ st[j * dk + i] = Sd[j * dk + i]; ++ } ++} ++ ++template ++static void launch_gdn_chunked( ++ const float * q_d, const float * k_d, const float * v_d, ++ const float * g_d, const float * b_d, const float * s_d, ++ float * dst_d, float * state_dst_d, const int32_t * ids_d, int rs_head, ++ int64_t H, int64_t n_tokens, int64_t n_seqs, ++ int64_t sq1, int64_t sq2, int64_t sq3, ++ int64_t sv1, int64_t sv2, int64_t sv3, ++ int64_t sb1, int64_t sb2, int64_t sb3, ++ const uint3 neqk1_magic, const uint3 rq3_magic, ++ float scale, cudaStream_t stream) { ++ const size_t smem = ((size_t) S_v * S_v + (size_t) 2 * C * S_v + (size_t) S_v * C ++ + (size_t) C * C + (size_t) 3 * C) * sizeof(float); ++ static bool attr_set = false; ++ if (!attr_set) { ++ const cudaError_t e = cudaFuncSetAttribute(gated_delta_net_chunked_cuda, ++ cudaFuncAttributeMaxDynamicSharedMemorySize, (int) smem); ++ if (e != cudaSuccess) { ++ GGML_ABORT("gdn chunked: cudaFuncSetAttribute(maxDynSmem=%zu) failed: %s\n", smem, cudaGetErrorString(e)); ++ } ++ attr_set = true; ++ } ++ dim3 grid_dims(H, n_seqs, 1); ++ dim3 block_dims(S_v, 1, 1); ++ gated_delta_net_chunked_cuda<<>>( ++ q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens, n_seqs, ++ sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, ++ neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head); ++} ++ + template + static void launch_gated_delta_net( + const float * q_d, const float * k_d, const float * v_d, +@@ -297,6 +513,27 @@ static void launch_gated_delta_net( + const uint3 neqk1_magic = init_fastdiv_values(neqk1); + const uint3 rq3_magic = init_fastdiv_values(rq3); + ++ // Chunked parallel-scan prefill path (upstream TODO at this site). Compile-time subset: ++ // non-KDA scalar gate, f32 state, final-state-only, homogeneous. Gated at runtime on the GDN ++ // head dim (S_v==128) and a prefill token threshold; decode (n_tokens small) keeps the tuned ++ // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5; ++ // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN. ++ if constexpr (!KDA && !keep_rs_t) { ++ // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16 ++ // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned ++ // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low ++ // n_seqs + 1 block/SM occupancy). Default OFF so the backend default is regression-free; ++ // enable for experiments / tuning with GDN_CHUNK_MIN=. See README section 5 (dev notes / rejected-flat levers). ++ static const int gdn_chunk_min = []{ const char * e = getenv("GDN_CHUNK_MIN"); return e ? atoi(e) : INT_MAX; }(); ++ if (S_v == 128 && n_tokens >= gdn_chunk_min) { ++ launch_gdn_chunked<128, 16>( ++ q_d, k_d, v_d, g_d, b_d, (const float *) s_d, dst_d, (float *) state_dst_d, ids_d, rs_head, ++ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, ++ neqk1_magic, rq3_magic, scale, stream); ++ return; ++ } ++ } ++ + #define GDN_LAUNCH_ARGS \ + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, state_dst_d, ids_d, rs_head, \ + H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \ +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index ac30e47..4e40d23 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -9398,6 +9398,14 @@ static std::vector> make_test_cases_eval() { + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 200, 1)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 127, 2)); ++ // chunked parallel-scan prefill path (S_v==128, n_tokens>=64): exact-multiple, tail, multi-seq, perm ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 64, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 128, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 127, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 256, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 100, 2)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 2, 128, 200, 3)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 128, 130, 1, 1, true)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 64, 1, 1, false, true)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 33, 1, 1, false, true)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 4, 64, 100, 1, 1, false, true)); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch new file mode 100644 index 000000000000..30060d30a72d --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0033-fp4-prefill-large-m-bf16-cublas-scaffold.patch @@ -0,0 +1,174 @@ +From 0033003300330033003300330033003300330033 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 19:35:00 +0200 +Subject: [PATCH] feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold + (default-off, rejected on GB10) (patch 0033) + +Option (a) of docs/PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense +weight GEMMs OFF the decode-tuned FP4-MMQ kernel and through the dequant->bf16 +cuBLAS (nvjet) tensor-core path. This lands the validated, bit-exact-gated +mechanism and records the honest result: on GB10 (sm_121) the lever is a +REGRESSION, so it is kept default-OFF (byte-identical to stock), mirroring the +patch-0017 default-off discipline. + +Mechanism (all three edits are the integration scaffold, no new kernel): + - ggml/src/ggml-cuda/mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + Blackwell + + dense (n_experts==0) + M > LLAMA_FP4_PREFILL_M returns false, so the dense + dispatch falls through to ggml_cuda_op_mul_mat_cublas. -D / env + LLAMA_FP4_PREFILL_M tunable; default 0 == disabled == stock. Decode and + small batches (M <= threshold) stay on FP4-MMQ. + - ggml/src/ggml-cuda/ggml-cuda.cu (ggml_cuda_op_mul_mat_cublas): new NVFP4 + branch dequants the FP4 weights to a TRANSIENT bf16 pool buffer (not cached, + so the model stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF / + COMPUTE_32F (tensor cores) instead of the f32 cublasSgemm fallback (no + tensor cores) that NVFP4 would otherwise hit. + - ggml/src/ggml-cuda/convert.cu (ggml_get_to_bf16_cuda): add the NVFP4 case + (the dequant kernel is dst-type generic; bf16 preserves the model's native + activation range vs f16). nullptr-by-default for other types is unchanged. + +Bit-exact / numeric gate (PASS, divergence benign): + - test-backend-ops MUL_MAT 1146/1146, MUL_MAT_ID 806/806 at default; and with + the path FORCED (LLAMA_FP4_PREFILL_M=64) the NVFP4 large-M cases are green + CUDA-vs-CPU (the bf16 path is numerically within the project tolerance). + - greedy md5 (q36-27b dense, "The capital of France is", -n 48, temp 0): + lever == base == 5951a5b4d624ce891e22ab5fca9bc439 (the documented dense + reference) for short prefill (decode byte-untouched), AND identical for a + >threshold prefill that exercises the new bf16 path (5f3967df...): the new + FP path does not flip a single greedy argmax. As predicted by the scope, + bf16 activations are strictly more precise than the FP4-MMQ Q8_1 path, so + this is precision-neutral-to-better, not a regression. + +Honest performance result (S_PP t/s, q36-27b dense, llama-batched-bench +-fa on -ngl 99, A/B via env), see docs/PREFILL_GEMM_RESULTS.md: + -npp 512 -npl 32 : base(MMQ) 958.99 -> lever 486.65 (-49%) + -npp 1024 -npl 8 : base(MMQ)1013.65 -> lever 587.27 (-42%) + -npp 2048 -npl 8 : base(MMQ) 918.46 -> lever 649.42 (-29%) +The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is FALSE on GB10: +FP4-MMQ at M=512..2048 beats dequant->bf16 cuBLAS, because bf16 tensor-core peak +is ~half FP4 peak AND the per-step weight dequant + 4x bf16 weight traffic +(~8x total vs the FP4 read) dominate, only partially amortizing as M grows +(gap shrinks 49%->29%, never crosses). Default-off keeps stock S_PP (966.98, +within noise of base). + +Phase 2 (MoE grouped large-M) is NOT implemented: it inherits the same +bf16-peak < FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS +would regress for the same reason. The only route to a real prefill GEMM win is +option (b) - a native FP4-MMA large-M kernel (multi-week). This patch is the +validated, env-gated scaffold that option (b) / non-GB10 hardware can reuse for +the M-threshold routing + bit-exact gate. + +Assisted-by: Claude:opus-4.8 [Claude Code] +--- +diff --git a/ggml/src/ggml-cuda/convert.cu b/ggml/src/ggml-cuda/convert.cu +index 61630a3..f0273c1 100644 +--- a/ggml/src/ggml-cuda/convert.cu ++++ b/ggml/src/ggml-cuda/convert.cu +@@ -704,6 +704,15 @@ to_bf16_cuda_t ggml_get_to_bf16_cuda(ggml_type type) { + return convert_unary_cont_cuda; + case GGML_TYPE_F16: + return convert_unary_cont_cuda; ++ // Paged prefill lever (patch 0033): NVFP4 -> bf16 dequant for the large-M ++ // dequant->bf16 cuBLAS (nvjet) prefill GEMM path in ++ // ggml_cuda_op_mul_mat_cublas. The dequant kernel is dst-type generic, so ++ // this instantiates the bf16 variant; bf16 (not f16) preserves the model's ++ // native bf16 activation range and avoids f16 overflow on large prefill ++ // activations. Only the new prefill path consumes this; nullptr-by-default ++ // for all other types is unchanged. ++ case GGML_TYPE_NVFP4: ++ return dequantize_row_nvfp4_cuda; + default: + return nullptr; + } +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index 9933fa6..2dcaaab 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -321,6 +321,33 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t + return false; + } + ++ // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs ++ // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet) ++ // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The ++ // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at ++ // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats ++ // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md), ++ // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight ++ // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and ++ // only partially amortize as M grows. The path is NUMERICALLY VALID and ++ // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so ++ // it is kept as a validated, env-gated scaffold (for option-(b) native FP4 ++ // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock). ++ // Set -D LLAMA_FP4_PREFILL_M= or env LLAMA_FP4_PREFILL_M= to A/B it; ++ // 0 (default) disables. Dense only (n_experts == 0). ++#ifndef LLAMA_FP4_PREFILL_M ++#define LLAMA_FP4_PREFILL_M 0 ++#endif // LLAMA_FP4_PREFILL_M ++ if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) { ++ static const int64_t fp4_prefill_m = [] { ++ const char * e = getenv("LLAMA_FP4_PREFILL_M"); ++ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M; ++ }(); ++ if (fp4_prefill_m > 0 && ne11 > fp4_prefill_m) { ++ return false; ++ } ++ } ++ + if (turing_mma_available(cc)) { + return true; + } +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 0dad6e1..6476d46 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -1660,7 +1660,47 @@ static void ggml_cuda_op_mul_mat_cublas( + row_diff == src0->ne[1] && + dst->op_params[0] == GGML_PREC_DEFAULT; + +- if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ if (supports_bf16 && src0->type == GGML_TYPE_NVFP4 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { ++ // Paged prefill lever (patch 0033): NVFP4 only reaches cuBLAS when ++ // ggml_cuda_should_use_mmq() returned false (large-M dense prefill). ++ // Dequant the FP4 weights to a TRANSIENT bf16 pool buffer and run a ++ // tensor-core bf16 GEMM (nvjet) instead of the f32 cublasSgemm fallback ++ // (no tensor cores) that the final else-branch would otherwise use. The ++ // weights are NOT cached as bf16 (pool scratch, freed at step end) so the ++ // model stays FP4-resident and the backend keeps its memory advantage. ++ ggml_cuda_pool_alloc src0_as_bf16(ctx.pool(id), row_diff*ne00); ++ const to_bf16_cuda_t to_bf16_cuda_src0 = ggml_get_to_bf16_cuda(GGML_TYPE_NVFP4); ++ GGML_ASSERT(to_bf16_cuda_src0 != nullptr); ++ to_bf16_cuda_src0(src0_dd_i, src0_as_bf16.get(), row_diff*ne00, stream); ++ ++ ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); ++ if (src1->type != GGML_TYPE_BF16) { ++ const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type); ++ GGML_ASSERT(to_bf16_cuda != nullptr); ++ size_t ne = src1_ncols*ne10; ++ src1_as_bf16.alloc(ne); ++ to_bf16_cuda(src1_ddf_i, src1_as_bf16.get(), ne, stream); ++ } ++ const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get(); ++ const nv_bfloat16 * src0_ptr = src0_as_bf16.get(); ++ ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); ++ ++ const float alpha_f32 = 1.0f; ++ const float beta_f32 = 0.0f; ++ ++ CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ ++ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); ++ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { + ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); + if (src1->type != GGML_TYPE_BF16) { + const to_bf16_cuda_t to_bf16_cuda = ggml_get_to_bf16_cuda(src1->type); +-- +2.43.0 diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch new file mode 100644 index 000000000000..c843d9d3c3ee --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0034-feat-paged-native-NVFP4-W4A4-FP4-MMA-large-M-prefill.patch @@ -0,0 +1,638 @@ +From 14824147a504b58cc8be2f127f7d6bedb672cfc9 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 29 Jun 2026 00:11:22 +0200 +Subject: [PATCH] feat(paged): native NVFP4 (W4A4) FP4-MMA large-M prefill GEMM + (patch 0034) + +Replace the rejected 0033 dequant->bf16 cuBLAS scaffold with a native FP4-MMA +(W4A4 block-scale OMMA) large-M GEMM that engages only at prefill, behind the +same LLAMA_FP4_PREFILL_M threshold, so decode / small-M stay byte-untouched. + +KERNEL (ggml/src/ggml-cuda/fp4-gemm.{cu,cuh}): the VERIFIED PoC +(fp4_gemm_w4a4_opt.cu, NMSE=0 vs same-dequant f32) copied verbatim at its tuned +best config 128x128 / KBLK4 / STAGES2 / PAD4 (~103 TFLOP/s, beats cuBLAS bf16). +Preserved exactly: e4m3(true_scale) convention, the ldmatrix.sync.m8n8.x4 A-operand +load, the mma.sync.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64 OMMA, cp.async +multistage prefetch, register-resident accumulators, smem PAD. Activations are +quantized with the SAME math as quantize_mmq_nvfp4 (e4m3 amax/6 + the +/-2 code +search + ggml_cuda_float_to_fp4_e2m1), so it is bit-exact-by-construction with the +shipped FP4-MMQ path (only the K-reduction order differs, greedy-md5 gated). + +DENSE: routed in ggml_cuda_mul_mat via ggml_cuda_fp4_prefill_should_engage() +(src0 NVFP4 + src1/dst f32, contiguous, non-transposed, 2D, Blackwell, M>thr, +N%128==0, K%256==0). Non-divisible shapes fall back to FP4-MMQ (NOT the rejected +bf16 cuBLAS path). LANDED + greedy-md5 byte-identical (on==off: "Paris"). + +MoE GROUPED (the actual prefill bottleneck): mmq.cu forces the grouped FP4-MMQ +id-path OFF at large M (n_experts>0), so mul_mat_id falls to its per-expert +host-sync loop where each expert slice flows back through ggml_cuda_mul_mat and +hits the native kernel per-expert. Prefill is not graph-replayed so this is safe; +decode keeps ne12<=threshold so the graph-safe MMQ id-path (patch 0025) is +untouched. LANDED via host-sync + greedy-md5 byte-identical (on==off). +FOLLOW-UP (flagged): a graph-safe ragged-batched grouped FP4-MMA kernel to remove +the per-expert host-sync loop; out of scope for this pass. + +BUILD: arch=compute_121a,code=[compute_121a,sm_121a] already in build-cuda flags; +the kernel uses BLACKWELL_MMA_AVAILABLE/CP_ASYNC_AVAILABLE guards. Incremental +build-cuda green (ggml-cuda relinked, llama-server + llama-cli relinked). + +Default-off (LLAMA_FP4_PREFILL_M=0 == stock); set env/-D to engage. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/fp4-gemm.cu | 453 ++++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/fp4-gemm.cuh | 38 +++ + ggml/src/ggml-cuda/ggml-cuda.cu | 14 + + ggml/src/ggml-cuda/mmq.cu | 35 +-- + 4 files changed, 525 insertions(+), 15 deletions(-) + create mode 100644 ggml/src/ggml-cuda/fp4-gemm.cu + create mode 100644 ggml/src/ggml-cuda/fp4-gemm.cuh + +diff --git a/ggml/src/ggml-cuda/fp4-gemm.cu b/ggml/src/ggml-cuda/fp4-gemm.cu +new file mode 100644 +index 0000000..86da551 +--- /dev/null ++++ b/ggml/src/ggml-cuda/fp4-gemm.cu +@@ -0,0 +1,453 @@ ++#include "fp4-gemm.cuh" ++ ++#include ++#include ++#include ++ ++// =========================================================================== ++// [paged patch 0034] Native NVFP4 (W4A4) large-M GEMM. See fp4-gemm.cuh. ++// ++// The GEMM kernel, the m16n8k64 block-scale OMMA wrapper, the cp.async helpers and ++// the layout-split kernel are the VERIFIED PoC (fp4_gemm_w4a4_opt.cu, NMSE=0) copied ++// verbatim - do not "tidy" the index math, it is the load-bearing correctness. ++// =========================================================================== ++ ++#define FP4_QK 64 // == QK_NVFP4 ++#define FP4_SAW 8 // u32 per nvfp4 block qs (32 bytes) ++ ++#ifndef LLAMA_FP4_PREFILL_M ++#define LLAMA_FP4_PREFILL_M 0 ++#endif // LLAMA_FP4_PREFILL_M ++ ++static int64_t ggml_cuda_fp4_prefill_m() { ++ static const int64_t m = [] { ++ const char * e = getenv("LLAMA_FP4_PREFILL_M"); ++ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M; ++ }(); ++ return m; ++} ++ ++// ---- layout split: block_nvfp4[R*Kb] -> qs codes [R*Kb*8 u32] + scales [R*Kb u32] ---- ++// Same fp4 codes & e4m3 scale bytes as the GGUF, restored into two contiguous, ++// 16B-friendly arrays so the kernel's cp.async copies are coalesced. (PoC verbatim.) ++static __global__ void fp4_split_layout( ++ const block_nvfp4 * __restrict__ X, uint32_t * __restrict__ Q, uint32_t * __restrict__ S, ++ int R, int Kb) { ++ const int64_t b = (int64_t) blockIdx.x * blockDim.x + threadIdx.x; ++ const int64_t tot = (int64_t) R * Kb; ++ if (b >= tot) { ++ return; ++ } ++ const block_nvfp4 & blk = X[b]; ++ const uint32_t * q = (const uint32_t *) blk.qs; ++ uint32_t * dq = &Q[b * 8]; ++#pragma unroll ++ for (int w = 0; w < 8; w++) { ++ dq[w] = q[w]; ++ } ++ S[b] = *(const uint32_t *) blk.d; ++} ++ ++// ---- activation quantizer: f32 [M_real x K] -> split NVFP4 (Aq codes + As scales) ---- ++// Uses the SAME math as quantize_mmq_nvfp4 (quantize.cu): e4m3 scale = ue4m3(amax/6) ++// with the +/-2 code search, ggml_cuda_float_to_fp4_e2m1 for the nibbles, so the ++// activation codes are identical to the shipped FP4-MMQ path. Packs into the PoC ++// block layout (qs[s*8+j] = code(e[j]) | code(e[j+8])<<4) expected by the kernel's ++// ldmatrix A-operand load. One thread per (row, kb, sub-block). ++static __global__ void fp4_quantize_act_split( ++ const float * __restrict__ x, uint32_t * __restrict__ Aq, uint32_t * __restrict__ As, ++ int M_real, int K, int Kb) { ++#ifdef BLACKWELL_MMA_AVAILABLE ++ const int64_t tot = (int64_t) M_real * Kb * 4; // 4 sub-blocks per 64-element block ++ const int64_t t = (int64_t) blockIdx.x * blockDim.x + threadIdx.x; ++ if (t >= tot) { ++ return; ++ } ++ const int sub = (int) (t & 3); ++ const int64_t rb = t >> 2; // row*Kb + kb ++ const int kb = (int) (rb % Kb); ++ const int64_t row = rb / Kb; ++ ++ const float * v16 = x + row * (int64_t) K + (int64_t) kb * FP4_QK + sub * 16; ++ float vals[16]; ++ float amax = 0.0f; ++#pragma unroll ++ for (int k = 0; k < 16; k++) { ++ const float vv = v16[k]; ++ vals[k] = vv; ++ amax = fmaxf(amax, fabsf(vv)); ++ } ++ ++ static constexpr int test_offsets[5] = { 0, -1, 1, -2, 2 }; ++ const int first_fp8_code = (int) ggml_cuda_fp32_to_ue4m3(amax / 6.0f); ++ ++ float best_err = FLT_MAX; ++ uint8_t fp8_code = 0; ++ float subblock_scale = 0.0f; ++#pragma unroll ++ for (int i = 0; i < 5; i++) { ++ const int test_code = first_fp8_code + test_offsets[i]; ++ if (test_code < 0 || test_code > 0x7e) { ++ continue; ++ } ++ const uint8_t code = (uint8_t) test_code; ++ const float test_scale = ggml_cuda_ue4m3_to_fp32(code); ++ const float test_inv_scale = test_scale > 0.0f ? 0.5f / test_scale : 0.0f; ++ float cur_err = 0.0f; ++#pragma unroll ++ for (int k = 0; k < 16; k++) { ++ const uint8_t q = ggml_cuda_float_to_fp4_e2m1(vals[k], test_inv_scale); ++ const float err_diff = fabsf(vals[k]) - fabsf((float) kvalues_mxfp4[q & 0x7]) * test_scale; ++ cur_err = fmaf(err_diff, err_diff, cur_err); ++ } ++ if (cur_err < best_err) { ++ best_err = cur_err; ++ fp8_code = code; ++ subblock_scale = test_scale; ++ } ++ } ++ const float inv_scale = subblock_scale > 0.0f ? 0.5f / subblock_scale : 0.0f; ++ ++ // PoC packing: qs[s*8+j] = code(e[j]) | code(e[j+8])<<4 -> two u32 words per sub-block. ++ uint32_t w0 = 0, w1 = 0; ++#pragma unroll ++ for (int j = 0; j < 4; j++) { ++ const uint32_t lo = ggml_cuda_float_to_fp4_e2m1(vals[j], inv_scale); ++ const uint32_t hi = ggml_cuda_float_to_fp4_e2m1(vals[j + 8], inv_scale); ++ w0 |= ((lo | (hi << 4)) & 0xff) << (8 * j); ++ } ++#pragma unroll ++ for (int j = 0; j < 4; j++) { ++ const uint32_t lo = ggml_cuda_float_to_fp4_e2m1(vals[j + 4], inv_scale); ++ const uint32_t hi = ggml_cuda_float_to_fp4_e2m1(vals[j + 12], inv_scale); ++ w1 |= ((lo | (hi << 4)) & 0xff) << (8 * j); ++ } ++ ++ const int64_t blk = row * (int64_t) Kb + kb; ++ Aq[blk * 8 + sub * 2 + 0] = w0; ++ Aq[blk * 8 + sub * 2 + 1] = w1; ++ reinterpret_cast(As + blk)[sub] = fp8_code; ++#else ++ GGML_UNUSED(x); GGML_UNUSED(Aq); GGML_UNUSED(As); ++ GGML_UNUSED(M_real); GGML_UNUSED(K); GGML_UNUSED(Kb); ++ NO_DEVICE_CODE; ++#endif // BLACKWELL_MMA_AVAILABLE ++} ++ ++// ---- native FP4 block-scale OMMA wrapper (PoC verbatim) ---- ++static __device__ __forceinline__ void fp4_mma( ++ float d[4], const uint32_t a[4], const uint32_t b[2], uint32_t as, uint32_t bs) { ++#ifdef BLACKWELL_MMA_AVAILABLE ++ asm volatile( ++ "mma.sync.aligned.kind::mxf4nvf4.block_scale.scale_vec::4X.m16n8k64.row.col.f32.e2m1.e2m1.f32.ue4m3 " ++ "{%0,%1,%2,%3},{%4,%5,%6,%7},{%8,%9},{%0,%1,%2,%3},%10,{0,0},%11,{0,0};" ++ : "+f"(d[0]),"+f"(d[1]),"+f"(d[2]),"+f"(d[3]) ++ : "r"(a[0]),"r"(a[1]),"r"(a[2]),"r"(a[3]),"r"(b[0]),"r"(b[1]),"r"(as),"r"(bs)); ++#else ++ GGML_UNUSED(d); GGML_UNUSED(a); GGML_UNUSED(b); GGML_UNUSED(as); GGML_UNUSED(bs); ++ NO_DEVICE_CODE; ++#endif // BLACKWELL_MMA_AVAILABLE ++} ++ ++// ---- cp.async helpers (PoC verbatim) ---- ++static __device__ __forceinline__ void fp4_cp_async16(void * smem, const void * gmem) { ++#ifdef CP_ASYNC_AVAILABLE ++ unsigned s = (unsigned) __cvta_generic_to_shared(smem); ++ asm volatile("cp.async.cg.shared.global [%0],[%1],16;\n" :: "r"(s), "l"(gmem)); ++#else ++ GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++template ++static __device__ __forceinline__ void fp4_cp_async_small(void * smem, const void * gmem) { ++#ifdef CP_ASYNC_AVAILABLE ++ unsigned s = (unsigned) __cvta_generic_to_shared(smem); ++ asm volatile("cp.async.ca.shared.global [%0],[%1],%2;\n" :: "r"(s), "l"(gmem), "n"(B)); ++#else ++ GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++static __device__ __forceinline__ void fp4_cp_commit() { ++#ifdef CP_ASYNC_AVAILABLE ++ asm volatile("cp.async.commit_group;\n" ::); ++#else ++ NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++template ++static __device__ __forceinline__ void fp4_cp_wait() { ++#ifdef CP_ASYNC_AVAILABLE ++ asm volatile("cp.async.wait_group %0;\n" :: "n"(N)); ++#else ++ NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++ ++// --------------------------------------------------------------------------- ++// Optimized native FP4 GEMM (PoC verbatim). C[M,N] = A_fp4[M,K] @ W_fp4[N,K]^T ++// inputs are layout-split: Aq[M*Kb*8], As[M*Kb], Wq[N*Kb*8], Ws[N*Kb] ++// Tile BM x BN, K-step = KBLK nvfp4 blocks (BK = 64*KBLK), STAGES-deep pipeline, ++// PAD u32 padding per smem row to defeat bank conflicts. ++// --------------------------------------------------------------------------- ++template ++__launch_bounds__(WARPS_M*WARPS_N*32,1) ++static __global__ void fp4_opt_kernel( ++ const uint32_t * __restrict__ Aq, const uint32_t * __restrict__ As, ++ const uint32_t * __restrict__ Wq, const uint32_t * __restrict__ Ws, ++ float * __restrict__ C, int M, int N, int K) { ++#ifdef BLACKWELL_MMA_AVAILABLE ++ constexpr int NWARP=WARPS_M*WARPS_N; ++ constexpr int THREADS=NWARP*32; ++ constexpr int WM=BM/WARPS_M, WN=BN/WARPS_N; ++ constexpr int MF=WM/16, NF=WN/8; ++ constexpr int SAW=8; // u32 per block (qs) ++ constexpr int ARS=KBLK*SAW+PAD; // A smem row stride (u32) ++ constexpr int WRS=KBLK*SAW+PAD; // W smem row stride (u32) ++ ++ extern __shared__ uint32_t smem[]; ++ // per-stage slabs ++ constexpr int SZ_AQ=BM*ARS, SZ_AS=BM*KBLK, SZ_WQ=BN*WRS, SZ_WS=BN*KBLK; ++ constexpr int STAGE_SZ=SZ_AQ+SZ_AS+SZ_WQ+SZ_WS; ++ uint32_t* sAq[STAGES]; uint32_t* sAs[STAGES]; uint32_t* sWq[STAGES]; uint32_t* sWs[STAGES]; ++#pragma unroll ++ for(int s=0;s>5, lane=tid&31; ++ const int wrow=warp/WARPS_N, wcol=warp%WARPS_N; ++ const int grp=lane>>2, tig=lane&3; ++ const int tidxA = lane/4 + (lane%2)*8; ++ const int tidxB = lane/4; ++ const int blockRow=blockIdx.y*BM, blockCol=blockIdx.x*BN; ++ const int Kb=K/64; ++ const int numK=Kb/KBLK; ++ ++ float acc[MF][NF][4]; ++#pragma unroll ++ for(int i=0;i>1; ++ int r=blk/KBLK, kb=blk%KBLK; ++ const uint32_t* src=&Aq[((size_t)(blockRow+r)*Kb + kb0+kb)*SAW + chunk*4]; ++ fp4_cp_async16(&sAq[st][r*ARS + kb*SAW + chunk*4], src); ++ } ++ // W qs ++#pragma unroll 1 ++ for(int idx=tid; idx>1; ++ int r=blk/KBLK, kb=blk%KBLK; ++ const uint32_t* src=&Wq[((size_t)(blockCol+r)*Kb + kb0+kb)*SAW + chunk*4]; ++ fp4_cp_async16(&sWq[st][r*WRS + kb*SAW + chunk*4], src); ++ } ++ // A scales: BM rows, KBLK contiguous u32 each ++#pragma unroll 1 ++ for(int r=tid; r(dst,src); ++ else fp4_cp_async_small<4>(dst,src); ++ } ++ // W scales ++#pragma unroll 1 ++ for(int r=tid; r(dst,src); ++ else fp4_cp_async_small<4>(dst,src); ++ } ++ }; ++ ++ // prologue: issue STAGES-1 tiles (tiles 0..STAGES-2 into stages 0..STAGES-2) ++#pragma unroll ++ for(int s=0;s(); ++ __syncthreads(); ++ ++ const int rs=kt%STAGES; ++#pragma unroll ++ for(int kb=0; kbtype != GGML_TYPE_NVFP4) { ++ return false; ++ } ++ if (!blackwell_mma_available(cc)) { ++ return false; ++ } ++ const int64_t thr = ggml_cuda_fp4_prefill_m(); ++ if (thr <= 0) { ++ return false; // default-off == stock; decode/small-M untouched ++ } ++ if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) { ++ return false; ++ } ++ if (src1->ne[1] <= thr) { ++ return false; // M = src1->ne[1]; only LARGE M (prefill) ++ } ++ if (!ggml_is_contiguous(src0) || !ggml_is_contiguous(src1) || !ggml_is_contiguous(dst)) { ++ return false; ++ } ++ if (ggml_is_transposed(src0) || ggml_is_transposed(src1)) { ++ return false; ++ } ++ // 2D only (a single weight matrix; per-expert MoE slices set ne[2]=ne[3]=1). ++ if (src0->ne[2] != 1 || src0->ne[3] != 1 || src1->ne[2] != 1 || src1->ne[3] != 1) { ++ return false; ++ } ++ const int64_t K = src0->ne[0]; ++ const int64_t N = src0->ne[1]; ++ if (N % 128 != 0 || K % 256 != 0) { ++ return false; // tile constraints; otherwise fall back to MMQ ++ } ++ return true; ++} ++ ++void ggml_cuda_mul_mat_fp4_large_m( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { ++ GGML_ASSERT(src0->type == GGML_TYPE_NVFP4); ++ GGML_ASSERT(src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32); ++ ++ const int64_t K = src0->ne[0]; ++ const int64_t N = src0->ne[1]; ++ const int64_t M = src1->ne[1]; ++ const int64_t Kb = K / FP4_QK; ++ GGML_ASSERT(K % 256 == 0 && N % 128 == 0); ++ ++ cudaStream_t stream = ctx.stream(); ++ ++ constexpr int BM = 128, BN = 128, WM = 4, WN = 2, KBLK = 4, STAGES = 2, PAD = 4; ++ const int64_t Mpad = ((M + BM - 1) / BM) * BM; ++ ++ ggml_cuda_pool_alloc Wq(ctx.pool(), (size_t) N * Kb * 8); ++ ggml_cuda_pool_alloc Ws(ctx.pool(), (size_t) N * Kb); ++ ggml_cuda_pool_alloc Aq(ctx.pool(), (size_t) Mpad * Kb * 8); ++ ggml_cuda_pool_alloc As(ctx.pool(), (size_t) Mpad * Kb); ++ ++ // Zero the scales of the padded A-rows (M..Mpad) so they contribute 0 (scale 0 -> ++ // the OMMA's per-block scale is 0). The padded qs may stay uninitialized. ++ if (Mpad > M) { ++ CUDA_CHECK(cudaMemsetAsync(As.get() + (size_t) M * Kb, 0, ++ (size_t) (Mpad - M) * Kb * sizeof(uint32_t), stream)); ++ } ++ ++ // split weights (GGUF block_nvfp4 -> Wq/Ws) ++ { ++ const int64_t tot = N * Kb; ++ const int threads = 256; ++ const int64_t grid = (tot + threads - 1) / threads; ++ fp4_split_layout<<>>( ++ (const block_nvfp4 *) src0->data, Wq.get(), Ws.get(), (int) N, (int) Kb); ++ CUDA_CHECK(cudaGetLastError()); ++ } ++ // quantize + split activations (real rows only) ++ { ++ const int64_t tot = M * Kb * 4; ++ const int threads = 256; ++ const int64_t grid = (tot + threads - 1) / threads; ++ fp4_quantize_act_split<<>>( ++ (const float *) src1->data, Aq.get(), As.get(), (int) M, (int) K, (int) Kb); ++ CUDA_CHECK(cudaGetLastError()); ++ } ++ ++ // Output: write the (Mpad x N) result straight into dst when M is tile-aligned, ++ // otherwise into a temp and copy back the first M rows (C is row-major C[m*N+n]). ++ float * Cout = (float *) dst->data; ++ ggml_cuda_pool_alloc Ctmp(ctx.pool()); ++ if (Mpad > M) { ++ Cout = Ctmp.alloc((size_t) Mpad * N); ++ } ++ ++ auto kern = fp4_opt_kernel; ++ constexpr int SZ_AQ = BM * (KBLK * 8 + PAD), SZ_AS = BM * KBLK; ++ constexpr int SZ_WQ = BN * (KBLK * 8 + PAD), SZ_WS = BN * KBLK; ++ constexpr int STAGE_SZ = SZ_AQ + SZ_AS + SZ_WQ + SZ_WS; ++ const int smem_bytes = STAGES * STAGE_SZ * (int) sizeof(uint32_t); ++ CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes)); ++ ++ dim3 grid((unsigned) (N / BN), (unsigned) (Mpad / BM)); ++ dim3 block(WM * WN * 32); ++ kern<<>>( ++ Aq.get(), As.get(), Wq.get(), Ws.get(), Cout, (int) Mpad, (int) N, (int) K); ++ CUDA_CHECK(cudaGetLastError()); ++ ++ if (Mpad > M) { ++ CUDA_CHECK(cudaMemcpyAsync(dst->data, Ctmp.get(), (size_t) M * N * sizeof(float), ++ cudaMemcpyDeviceToDevice, stream)); ++ } ++} +diff --git a/ggml/src/ggml-cuda/fp4-gemm.cuh b/ggml/src/ggml-cuda/fp4-gemm.cuh +new file mode 100644 +index 0000000..8ed1aa4 +--- /dev/null ++++ b/ggml/src/ggml-cuda/fp4-gemm.cuh +@@ -0,0 +1,38 @@ ++#pragma once ++ ++#include "common.cuh" ++ ++// [paged patch 0034] Native NVFP4 (W4A4) large-M GEMM for Blackwell sm_121a (GB10). ++// ++// A Marlin-class tiled FP4-MMA GEMM (cp.async multistage prefetch, register-resident ++// accumulators, ldmatrix A-operand, m16n8k64 mxf4nvf4 block-scale OMMA with e4m3 ++// true-scale) that beats the dequant->bf16 cuBLAS (nvjet) path that the rejected 0033 ++// scaffold routed large-M prefill through. The kernel body is the bit-exact PoC ++// (NMSE=0 vs a same-dequant f32 reference) at its tuned best config ++// (128x128 / KBLK4 / STAGES2 / PAD4). ++// ++// It is bit-exact-by-construction with the shipped FP4-MMQ path: it consumes the SAME ++// e2m1 weight nibbles + e4m3 scale bytes from the GGUF block_nvfp4, quantizes ++// activations with the SAME math as quantize_mmq_nvfp4 (e4m3 amax/6 scale + the +/-2 ++// code search + ggml_cuda_float_to_fp4_e2m1), and feeds the SAME hardware OMMA. The ++// only difference vs FP4-MMQ is the K-accumulation order (a different but equivalent ++// f32 reduction tree), which is greedy-md5 gated like every other paged path. ++// ++// Engages ONLY at large M (prefill), behind the 0033 LLAMA_FP4_PREFILL_M threshold; ++// decode and small-M are byte-untouched and never reach this kernel. ++ ++// True if the native FP4 large-M path should handle this dense NVFP4 mul_mat: ++// src0 NVFP4 + src1/dst f32, contiguous, not transposed, 2D, Blackwell, ++// LLAMA_FP4_PREFILL_M > 0, M = src1->ne[1] > threshold, N % 128 == 0, K % 256 == 0. ++// This single predicate also routes per-expert MoE slices (they flow through ++// ggml_cuda_mul_mat) into the native kernel. ++bool ggml_cuda_fp4_prefill_should_engage( ++ const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc); ++ ++// Native FP4 W4A4 GEMM: dst[M,N] = src1_act[M,K] @ src0_w[N,K]^T. ++// src0 = NVFP4 weights, src1 = f32 activations, dst = f32. Streams on ctx.stream(), ++// pool-allocates scratch; no host sync. Caller must have checked ++// ggml_cuda_fp4_prefill_should_engage(). ++void ggml_cuda_mul_mat_fp4_large_m( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst); +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 2ecc971..a92003c 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -25,6 +25,7 @@ + #include "ggml-cuda/diagmask.cuh" + #include "ggml-cuda/diag.cuh" + #include "ggml-cuda/fattn.cuh" ++#include "ggml-cuda/fp4-gemm.cuh" + #include "ggml-cuda/fwht.cuh" + #include "ggml-cuda/getrows.cuh" + #include "ggml-cuda/im2col.cuh" +@@ -2541,6 +2542,19 @@ static bool ggml_cuda_should_fuse_mul_mat_vec_q(const ggml_tensor * tensor) { + static void ggml_cuda_mul_mat(ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst) { + const bool split = ggml_backend_buft_is_cuda_split(src0->buffer->buft); + ++ // [paged patch 0034] Native NVFP4 (W4A4) large-M (prefill) FP4-MMA GEMM. Engages only ++ // when LLAMA_FP4_PREFILL_M>0 and M=src1->ne[1] exceeds it (and tile dims divide), so ++ // decode / small-M is byte-untouched. This also catches the per-expert MoE slices that ++ // flow through here from the mul_mat_id host-sync loop, routing each expert GEMM to the ++ // native kernel (see ggml_cuda_should_use_mmq's MoE gate in mmq.cu). ++ if (!split) { ++ const int cc_fp4 = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ if (ggml_cuda_fp4_prefill_should_engage(src0, src1, dst, cc_fp4)) { ++ ggml_cuda_mul_mat_fp4_large_m(ctx, src0, src1, dst); ++ return; ++ } ++ } ++ + // If src0 is a temporary compute buffer it may have some padding that needs to be cleared for mul_mat_vec_q or mul_mat_q. + // But if src0 is also a view of another tensor then this cannot be done safely because it may overwrite valid tensor data. + // Therefore, in such cases use cuBLAS. +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index 2dcaaab..694a402 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -321,24 +321,29 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t + return false; + } + +- // Paged prefill lever (patch 0033): OPTION-(a) route large-M NVFP4 dense GEMMs +- // OFF the FP4-MMQ kernel and through the dequant->bf16 cuBLAS (nvjet) +- // tensor-core path (ggml_cuda_op_mul_mat_cublas, NVFP4 bf16 branch). The +- // scope premise was that FP4-MMQ is register-bound to ~3% of FP4 peak at +- // large M. MEASURED ON GB10 THIS IS FALSE: FP4-MMQ at M=512..2048 beats +- // dequant->bf16 cuBLAS by 29-49% (S_PP A/B in docs/PREFILL_GEMM_RESULTS.md), +- // because bf16 tensor-core peak is ~half FP4 peak AND the per-step weight +- // dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate and +- // only partially amortize as M grows. The path is NUMERICALLY VALID and +- // benign (greedy md5 byte-identical to FP4-MMQ; test-backend-ops passes), so +- // it is kept as a validated, env-gated scaffold (for option-(b) native FP4 +- // large-M kernels and non-GB10 hardware), but DEFAULT-DISABLED (== stock). +- // Set -D LLAMA_FP4_PREFILL_M= or env LLAMA_FP4_PREFILL_M= to A/B it; +- // 0 (default) disables. Dense only (n_experts == 0). ++ // Paged prefill lever (patch 0033 -> 0034): route large-M NVFP4 prefill GEMMs to the ++ // native FP4-MMA (W4A4 OMMA) kernel in fp4-gemm.cu instead of the FP4-MMQ kernel. ++ // ++ // - DENSE (n_experts == 0): the reroute happens earlier, in ggml_cuda_mul_mat's ++ // ggml_cuda_fp4_prefill_should_engage() early check, which knows the N/K tile ++ // divisibility. We deliberately do NOT force dense off MMQ here: if the native ++ // kernel cannot take a shape (non-divisible N/K) MMQ stays the correct fallback, ++ // NOT the rejected dequant->bf16 cuBLAS path. ++ // - MoE (n_experts > 0): force the grouped FP4-MMQ id-path OFF at large M so ++ // mul_mat_id falls to its per-expert host-sync loop, where each expert slice flows ++ // back through ggml_cuda_mul_mat and hits the native kernel. CUDA graphs are ++ // disabled for that prefill step (prefill is not graph-replayed); a graph-safe ++ // grouped (ragged-batched) FP4-MMA kernel is the flagged follow-up. Decode keeps ++ // ne12 <= threshold so the grouped graph-safe MMQ id-path (patch 0025) is untouched. ++ // ++ // The historical 0033 finding stands: dequant->bf16 cuBLAS LOSES to FP4-MMQ at large M ++ // (bf16 tensor-core peak is ~half FP4 peak + 8x weight traffic), which is exactly why ++ // the native FP4-MMA kernel (NMSE=0, ~103 TFLOP/s, beats cuBLAS bf16) replaces it here. ++ // Set -D LLAMA_FP4_PREFILL_M= or env LLAMA_FP4_PREFILL_M=; 0 (default) == stock. + #ifndef LLAMA_FP4_PREFILL_M + #define LLAMA_FP4_PREFILL_M 0 + #endif // LLAMA_FP4_PREFILL_M +- if (type == GGML_TYPE_NVFP4 && n_experts == 0 && blackwell_mma_available(cc)) { ++ if (type == GGML_TYPE_NVFP4 && n_experts > 0 && blackwell_mma_available(cc)) { + static const int64_t fp4_prefill_m = [] { + const char * e = getenv("LLAMA_FP4_PREFILL_M"); + return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_FP4_PREFILL_M; +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch new file mode 100644 index 000000000000..0117722d05ef --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0035-feat-paged-marlin-w4a16-grouped-moe-prefill-gemm.patch @@ -0,0 +1,572 @@ +From df186bd20a23a1baae92f2828fc68f240c115e7d Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 29 Jun 2026 03:34:48 +0200 +Subject: [PATCH] feat(paged): Marlin-style W4A16 grouped MoE prefill GEMM + (patch 0035) + +Profile-validated #2 prefill lever: a DISTINCT kernel from the two prefill +rejects. NOT patch 0033 (separate-pass dequant -> bf16 cuBLAS/nvjet, lost to +FP4-MMQ at large M). NOT patch 0034 (native W4A4 FP4-MMA mxf4nvf4 OMMA, still +pays the quantize_mmq_nvfp4 activation-quant tax). This is the W4A16 shape vLLM +uses on sm_121: FP4 expert weights dequantized to bf16 IN REGISTERS right before +the MMA, activations kept bf16 (a cheap f32->bf16 cast, NO per-block amax/code +quantize -> ZERO activation-quant tax), standard bf16 m16n8k16 mma.sync (reuses +ggml/src/ggml-cuda/mma.cuh tiles) into f32 accumulators, cp.async multistage. + +GROUPED (the actual prefill shape): one kernel launch over the mul_mat_id +token-sorted activation buffer (src1_sorted is already sorted-by-expert by the +existing host path), with a per-M-tile expert map so each output tile reads its +own expert weight matrix (src0 + expert*nb02); the ragged per-expert row tail is +masked. No per-expert kernel launch, no per-expert M-padding (vs the 0034 +per-expert host-sync loop). The B (weight) fragment is filled by in-register +FP4->bf16 dequant via the tile get_i/get_j contract (correct-by-construction +vs ldmatrix); the A (activation) fragment is a bf16 ldmatrix. + +ROUTING (default-off; distinct env from 0034): + - mmq.cu (ggml_cuda_should_use_mmq): NVFP4 + n_experts>0 + Blackwell + + ne11(tokens) > LLAMA_W4A16_PREFILL_M returns false, so mul_mat_id falls to + the token-sorting host path. + - ggml-cuda.cu (ggml_cuda_mul_mat_id): once src1_sorted is built, if + ggml_cuda_w4a16_moe_grouped_should_engage() the grouped kernel replaces the + per-expert GEMM loop (dst_sorted then scattered back as usual). Decode keeps + ne12 <= threshold so the graph-safe grouped MMQ id-path (0025/0043) is + untouched; non-MoE / non-NVFP4 / small-M are byte-untouched. + +TOGGLE / A-B: env (or -D) LLAMA_W4A16_PREFILL_M. 0 (default) == OFF == stock; +>0 engages for MoE prefill GEMMs with tokens > the value. LLAMA_W4A16_DEBUG=1 +prints per-GEMM engagement (total_rows / n_tiles / max-tokens-per-expert). + +VALIDATION (GB10, sm_121a, Qwen3.6-35B-A3B-NVFP4): + - test-backend-ops MUL_MAT_ID nvfp4 (CUDA0 vs CPU oracle), W4A16 forced + (LLAMA_W4A16_PREFILL_M=1): 81/81 OK, 0 FAIL (incl. multi-tile-per-expert + cases). The threading bug found here (mma.cuh tile ops use threadIdx.x AS the + warp lane, so the block must be 2D (32,NWARP)) is fixed. + - greedy md5 (paged MoE, LLAMA_KV_PAGED=1): NOT-engaged (high threshold) == + OFF baseline 4a3fd812 BYTE-IDENTICAL (default-off is stock); engaged + (120 grouped GEMMs on a 116-token prefill) is coherent + benign (a different + but equivalent bf16-vs-Q8_1 K-reduction, like the documented paged-MoE path + divergence), output near-identical to stock. + +HONEST PERF (S_PP t/s, llama-batched-bench -fa on -ngl 99 -ntg 32 -npl 1, +LLAMA_KV_PAGED=1, OFF vs W4A16 thr=64), CURRENTLY A REGRESSION: + npp 512 : 1096.7 -> 794.8 (-28%) + npp 1024: 1413.5 -> 961.1 (-32%) + npp 2048: 1671.3 -> 1069.6 (-36%) +Decode TG unaffected (~53 t/s both). The kernel is CORRECT but its first +untuned config (BM64/BN128/STAGES2, scalar in-register dequant, f32->bf16 cast +pre-pass, 4B weight cp.async, BM-tile ragged-utilization waste, per-GEMM host +tile-map + 3 H2D copies) does not yet beat the tuned FP4-MMQ grouped path on +GB10; it does not realize the profiled vLLM 2.16x. Ships DEFAULT-OFF (like 0033 +scaffold / 0017) as the validated, env-gated mechanism + bit-exact gate for the +tuning follow-ups (deeper pipeline, ldmatrix/16B weight staging, smem-conflict +padding, larger/register-resident tiles, removing the cast pre-pass, dropping +the per-GEMM host map). + +Build: arch=compute_121a,code=[compute_121a,sm_121a]; BLACKWELL_MMA_AVAILABLE / +AMPERE_MMA_AVAILABLE / CP_ASYNC_AVAILABLE guards (NO_DEVICE_CODE off-Blackwell). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 12 + + ggml/src/ggml-cuda/mmq.cu | 17 ++ + ggml/src/ggml-cuda/w4a16-gemm.cu | 359 ++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/w4a16-gemm.cuh | 55 +++++ + 4 files changed, 443 insertions(+) + create mode 100644 ggml/src/ggml-cuda/w4a16-gemm.cu + create mode 100644 ggml/src/ggml-cuda/w4a16-gemm.cuh + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 3151684..37e4d11 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -26,6 +26,7 @@ + #include "ggml-cuda/diag.cuh" + #include "ggml-cuda/fattn.cuh" + #include "ggml-cuda/fp4-gemm.cuh" ++#include "ggml-cuda/w4a16-gemm.cuh" + #include "ggml-cuda/fwht.cuh" + #include "ggml-cuda/getrows.cuh" + #include "ggml-cuda/im2col.cuh" +@@ -2747,6 +2748,16 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor * + ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream); + CUDA_CHECK(cudaGetLastError()); + ++ // [paged patch 0035] Marlin-style W4A16 grouped MoE prefill GEMM: one launch over the ++ // token-sorted activation buffer (src1_sorted, already f32 + sorted-by-expert above) with a ++ // per-tile expert map, in-register FP4->bf16 weight dequant + bf16 mma. Replaces the ++ // per-expert host-sync GEMM loop. Engages only when LLAMA_W4A16_PREFILL_M>0 and ne12>thr ++ // (large-M prefill); decode / non-NVFP4 keep the loop below (byte-identical to stock). ++ if (ggml_cuda_w4a16_moe_grouped_should_engage(src0, src1, dst, cc)) { ++ ggml_cuda_mul_mat_id_w4a16_grouped(ctx, src0, ++ (const float *) src1_sorted.ptr, (float *) dst_sorted.ptr, ++ tokens_per_expert.data(), ne02, ne10, ne0, stream); ++ } else { + char * src1_data_cur = (char *) src1_sorted.ptr; + char * dst_data_cur = (char *) dst_sorted.ptr; + for (int64_t i02 = 0; i02 < ne02; ++i02) { +@@ -2795,6 +2806,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor * + src1_data_cur += src1_slice.nb[2]; + dst_data_cur += dst_slice.nb[2]; + } ++ } + + get_rows_cuda(dst_sorted.ptr, type_dst_sorted, ids_from_sorted, dst->data, dst->type, + ne0, ne0*ts_dst_sorted, ne_get_rows*ne0*ts_dst_sorted, ne_get_rows*ne0*ts_dst_sorted, +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index 694a402..dc5c2d1 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -353,6 +353,23 @@ bool ggml_cuda_should_use_mmq(enum ggml_type type, int cc, int64_t ne11, int64_t + } + } + ++ // Paged prefill lever (patch 0035): the Marlin-style W4A16 grouped MoE GEMM also needs the ++ // grouped FP4-MMQ id-path forced OFF at large M so mul_mat_id falls to the token-sorting ++ // host path, where the grouped W4A16 kernel is dispatched (in-register FP4->bf16 dequant + ++ // bf16 mma, ZERO activation-quant). Distinct env from 0034; default 0 == stock. ++#ifndef LLAMA_W4A16_PREFILL_M ++#define LLAMA_W4A16_PREFILL_M 0 ++#endif // LLAMA_W4A16_PREFILL_M ++ if (type == GGML_TYPE_NVFP4 && n_experts > 0 && blackwell_mma_available(cc)) { ++ static const int64_t w4a16_prefill_m = [] { ++ const char * e = getenv("LLAMA_W4A16_PREFILL_M"); ++ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_W4A16_PREFILL_M; ++ }(); ++ if (w4a16_prefill_m > 0 && ne11 > w4a16_prefill_m) { ++ return false; ++ } ++ } ++ + if (turing_mma_available(cc)) { + return true; + } +diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cu b/ggml/src/ggml-cuda/w4a16-gemm.cu +new file mode 100644 +index 0000000..f348f31 +--- /dev/null ++++ b/ggml/src/ggml-cuda/w4a16-gemm.cu +@@ -0,0 +1,359 @@ ++#include "w4a16-gemm.cuh" ++#include "mma.cuh" ++ ++#include ++#include ++#include ++#include ++ ++// =========================================================================== ++// [paged patch 0035] Marlin-style W4A16 grouped MoE prefill GEMM. See w4a16-gemm.cuh. ++// ++// In-register FP4->bf16 weight dequant + bf16 activations + bf16 m16n8k16 mma.sync (mma.cuh), ++// cp.async multistage, grouped (ragged, per-tile expert offset) over the token-sorted buffer. ++// =========================================================================== ++ ++using namespace ggml_cuda_mma; ++typedef tile<16, 8, nv_bfloat162> tile_A; // A operand: M=16, K=16 ++typedef tile< 8, 8, nv_bfloat162> tile_B; // B operand: N=8, K=16 ++typedef tile<16, 8, float> tile_C; // accumulator: M=16, N=8 ++ ++#ifndef LLAMA_W4A16_PREFILL_M ++#define LLAMA_W4A16_PREFILL_M 0 ++#endif // LLAMA_W4A16_PREFILL_M ++ ++int64_t ggml_cuda_w4a16_prefill_m() { ++ static const int64_t m = [] { ++ const char * e = getenv("LLAMA_W4A16_PREFILL_M"); ++ return e != nullptr ? (int64_t) atoll(e) : (int64_t) LLAMA_W4A16_PREFILL_M; ++ }(); ++ return m; ++} ++ ++bool ggml_cuda_w4a16_prefill_enabled() { ++ return ggml_cuda_w4a16_prefill_m() > 0; ++} ++ ++// ---- cp.async helpers (sm80+; raw bytes, no cast) ---- ++static __device__ __forceinline__ void w4a16_cp_async16(void * smem, const void * gmem) { ++#ifdef CP_ASYNC_AVAILABLE ++ const unsigned s = (unsigned) __cvta_generic_to_shared(smem); ++ asm volatile("cp.async.cg.shared.global [%0],[%1],16;\n" :: "r"(s), "l"(gmem)); ++#else ++ GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++static __device__ __forceinline__ void w4a16_cp_async4(void * smem, const void * gmem) { ++#ifdef CP_ASYNC_AVAILABLE ++ const unsigned s = (unsigned) __cvta_generic_to_shared(smem); ++ asm volatile("cp.async.ca.shared.global [%0],[%1],4;\n" :: "r"(s), "l"(gmem)); ++#else ++ GGML_UNUSED(smem); GGML_UNUSED(gmem); NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++static __device__ __forceinline__ void w4a16_cp_commit() { ++#ifdef CP_ASYNC_AVAILABLE ++ asm volatile("cp.async.commit_group;\n" ::); ++#else ++ NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++template static __device__ __forceinline__ void w4a16_cp_wait() { ++#ifdef CP_ASYNC_AVAILABLE ++ asm volatile("cp.async.wait_group %0;\n" :: "n"(N)); ++#else ++ NO_DEVICE_CODE; ++#endif // CP_ASYNC_AVAILABLE ++} ++ ++// ---- f32 -> bf16 activation cast (NO quantize). Pads the [total_rows, pad_rows) tail with 0. ---- ++static __global__ void w4a16_cast_act_f32_bf16( ++ const float * __restrict__ x, nv_bfloat16 * __restrict__ y, int64_t n, int64_t npad) { ++ const int64_t i = (int64_t) blockIdx.x * blockDim.x + threadIdx.x; ++ if (i >= npad) { ++ return; ++ } ++ y[i] = i < n ? __float2bfloat16(x[i]) : (nv_bfloat16) 0.0f; ++} ++ ++// --------------------------------------------------------------------------- ++// Grouped W4A16 GEMM. For each output tile (blockIdx.x = N-block, blockIdx.y = M-tile): ++// expert e = g_tile_expert[blockIdx.y] ++// row_start = g_tile_row0[blockIdx.y] (absolute row in the sorted buffer) ++// row_count = g_tile_rows[blockIdx.y] (valid rows in this tile, <= BM) ++// Weights read from W = src0 + e*expert_stride_blocks (block_nvfp4 [N,Kb]); activations from ++// Abf (bf16, sorted); output to C (f32, sorted, [N, total_rows] = C[row*N + col]). ++// Weights are dequantized FP4->bf16 in registers; A via ldmatrix; bf16 m16n8k16 mma. ++// BK = 64 (one nvfp4 block per K-step); STAGES-deep cp.async pipeline over the Kb blocks. ++// --------------------------------------------------------------------------- ++template ++__launch_bounds__(WARPS_M*WARPS_N*32, 1) ++static __global__ void w4a16_grouped_kernel( ++ const nv_bfloat16 * __restrict__ Abf, // [pad_rows, K] bf16 ++ const block_nvfp4 * __restrict__ W0, // src0 base (expert 0) ++ float * __restrict__ C, // [total_rows, N] f32 ++ const int * __restrict__ g_tile_expert, ++ const int * __restrict__ g_tile_row0, ++ const int * __restrict__ g_tile_rows, ++ int N, int K, int64_t expert_stride_blocks) { ++#if defined(AMPERE_MMA_AVAILABLE) && defined(CP_ASYNC_AVAILABLE) ++ constexpr int BK = 64; // one nvfp4 block ++ constexpr int NWARP = WARPS_M*WARPS_N; ++ constexpr int THREADS = NWARP*32; ++ constexpr int WM = BM/WARPS_M, WN = BN/WARPS_N; ++ constexpr int MF = WM/16, NF = WN/8; ++ ++ constexpr int AN = BK/2; // bf16 pairs per A smem row (nv_bfloat162) ++ constexpr int SZ_A = BM*AN; // nv_bfloat162 per stage ++ constexpr int SZ_WQ = BN*8; // u32 per stage (32 qs bytes/row) ++ constexpr int SZ_WD = BN; // u32 per stage (4 scale bytes/row) ++ ++ extern __shared__ uint32_t smem_u32[]; ++ // Layout per stage: [A as u32 = nv_bfloat162][Wq u32][Wd u32] ++ constexpr int STAGE_U32 = SZ_A + SZ_WQ + SZ_WD; ++ nv_bfloat162 * sA[STAGES]; ++ uint32_t * sWq[STAGES]; ++ uint32_t * sWd[STAGES]; ++#pragma unroll ++ for (int s = 0; s < STAGES; s++) { ++ uint32_t * base = smem_u32 + s*STAGE_U32; ++ sA[s] = (nv_bfloat162 *) base; ++ sWq[s] = base + SZ_A; ++ sWd[s] = base + SZ_A + SZ_WQ; ++ } ++ ++ // mma.cuh's tile ops (load_ldmatrix / mma / tile::get_i/get_j) use threadIdx.x AS THE WARP LANE, ++ // so the block MUST be 2D (32, NWARP): threadIdx.x = lane (0..31), threadIdx.y = warp. ++ const int lane = threadIdx.x; // 0..31 ++ const int warp = threadIdx.y; // 0..NWARP-1 ++ const int tid = warp*32 + lane; // linear id for the cp.async strided copies ++ const int wrow = warp / WARPS_N, wcol = warp % WARPS_N; ++ ++ const int e = g_tile_expert[blockIdx.y]; ++ const int row0 = g_tile_row0[blockIdx.y]; ++ const int rcount = g_tile_rows[blockIdx.y]; ++ const int blockCol = blockIdx.x*BN; ++ const int Kb = K/64; ++ const block_nvfp4 * We = W0 + (int64_t) e*expert_stride_blocks; // expert e weight base ++ ++ tile_C acc[MF][NF]; ++ ++ // async-load K-block `kt` into stage `st` ++ auto load_tile = [&](int st, int kt) { ++ // A: BM rows x BK bf16 = BM x AN nv_bfloat162 = BM x (BK/8) 16B chunks ++ const int A_chunks = BM*(BK/8); ++#pragma unroll 1 ++ for (int idx = tid; idx < A_chunks; idx += THREADS) { ++ const int c = idx % (BK/8); // 16B chunk in the row ++ const int r = idx / (BK/8); // row in tile ++ const nv_bfloat16 * src = Abf + (int64_t)(row0 + r)*K + (int64_t)kt*BK + c*8; ++ w4a16_cp_async16(((char *) sA[st]) + (r*AN + c*4)*sizeof(uint32_t), src); ++ } ++ // W qs: BN rows x 32 bytes = BN x 8 u32 (each block's qs at byte offset 4) ++#pragma unroll 1 ++ for (int idx = tid; idx < BN*8; idx += THREADS) { ++ const int w = idx & 7; // u32 word in the 32-byte qs ++ const int r = idx >> 3; // row in tile ++ const block_nvfp4 * blk = We + (int64_t)(blockCol + r)*Kb + kt; ++ const char * src = ((const char *) blk) + 4 /*d[4]*/ + w*4; ++ w4a16_cp_async4(&sWq[st][r*8 + w], src); ++ } ++ // W scales: BN rows x 4 bytes (one u32 each, the block's d[4] at byte offset 0) ++#pragma unroll 1 ++ for (int r = tid; r < BN; r += THREADS) { ++ const block_nvfp4 * blk = We + (int64_t)(blockCol + r)*Kb + kt; ++ w4a16_cp_async4(&sWd[st][r], (const char *) blk); ++ } ++ }; ++ ++ // prologue ++#pragma unroll ++ for (int s = 0; s < STAGES-1; s++) { if (s < Kb) load_tile(s, s); w4a16_cp_commit(); } ++ ++ for (int kt = 0; kt < Kb; kt++) { ++ const int ld = kt + (STAGES-1); ++ if (ld < Kb) load_tile(ld % STAGES, ld); ++ w4a16_cp_commit(); ++ w4a16_cp_wait(); ++ __syncthreads(); ++ ++ const int rs = kt % STAGES; ++ const nv_bfloat162 * sAcur = sA[rs]; ++ const uint8_t * sWqb = (const uint8_t *) sWq[rs]; // BN rows x 32 bytes ++ const uint32_t * sWdw = sWd[rs]; // BN rows x 1 u32 (4 scale bytes) ++ ++#pragma unroll ++ for (int kk = 0; kk < BK/16; kk++) { // 4 m16n8k16 sub-steps per 64-block ++ const int sub = kk; // sub-block (0..3): selects scale + nibble half ++ // A fragments via ldmatrix (bf16) ++ tile_A A_frag[MF]; ++#pragma unroll ++ for (int mi = 0; mi < MF; mi++) { ++ const int rb = wrow*WM + mi*16; ++ load_ldmatrix(A_frag[mi], sAcur + rb*AN + kk*8, AN); ++ } ++ // B fragments: in-register FP4->bf16 dequant (correct-by-construction via tile get_i/get_j) ++ tile_B B_frag[NF]; ++ const int n_local = lane >> 2; // tile_B::get_i (row N, 0..7) ++ const int jc = lane & 3; // lane%4 ++ const int qbyte = sub*8 + 2*jc; // qs byte index for this lane within the block ++#pragma unroll ++ for (int ni = 0; ni < NF; ni++) { ++ const int nrow = wcol*WN + ni*8 + n_local; // col within BN tile [0,BN) ++ const uint8_t * qsb = sWqb + nrow*32; // this row's 32 qs bytes ++ const uint8_t b0 = qsb[qbyte]; ++ const uint8_t b1 = qsb[qbyte + 1]; ++ const float sc = ggml_cuda_ue4m3_to_fp32(((const uint8_t *) &sWdw[nrow])[sub]); ++ // x[0]: low nibbles (k = 2jc, 2jc+1) ++ B_frag[ni].x[0].x = __float2bfloat16(sc * (float) kvalues_mxfp4[b0 & 0x0F]); ++ B_frag[ni].x[0].y = __float2bfloat16(sc * (float) kvalues_mxfp4[b1 & 0x0F]); ++ // x[1]: high nibbles (k = 8+2jc, 9+2jc) ++ B_frag[ni].x[1].x = __float2bfloat16(sc * (float) kvalues_mxfp4[b0 >> 4]); ++ B_frag[ni].x[1].y = __float2bfloat16(sc * (float) kvalues_mxfp4[b1 >> 4]); ++ } ++#pragma unroll ++ for (int mi = 0; mi < MF; mi++) ++#pragma unroll ++ for (int ni = 0; ni < NF; ni++) ++ mma(acc[mi][ni], A_frag[mi], B_frag[ni]); ++ } ++ __syncthreads(); ++ } ++ ++ // write back (mask the ragged per-expert row tail) ++#pragma unroll ++ for (int mi = 0; mi < MF; mi++) ++#pragma unroll ++ for (int ni = 0; ni < NF; ni++) { ++ const int orow = wrow*WM + mi*16; ++ const int ocol = blockCol + wcol*WN + ni*8; ++#pragma unroll ++ for (int l = 0; l < acc[mi][ni].ne; l++) { ++ const int lr = orow + acc[mi][ni].get_i(l); // local row within tile ++ const int nc = ocol + acc[mi][ni].get_j(l); // global col ++ if (lr < rcount && nc < N) { ++ C[(int64_t)(row0 + lr)*N + nc] = acc[mi][ni].x[l]; ++ } ++ } ++ } ++#else ++ GGML_UNUSED(Abf); GGML_UNUSED(W0); GGML_UNUSED(C); ++ GGML_UNUSED(g_tile_expert); GGML_UNUSED(g_tile_row0); GGML_UNUSED(g_tile_rows); ++ GGML_UNUSED(N); GGML_UNUSED(K); GGML_UNUSED(expert_stride_blocks); ++ NO_DEVICE_CODE; ++#endif // AMPERE_MMA_AVAILABLE && CP_ASYNC_AVAILABLE ++} ++ ++// =========================================================================== ++// host integration ++// =========================================================================== ++ ++bool ggml_cuda_w4a16_moe_grouped_should_engage( ++ const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc) { ++ if (src0->type != GGML_TYPE_NVFP4) { ++ return false; ++ } ++ if (!blackwell_mma_available(cc)) { ++ return false; ++ } ++ if (!ggml_cuda_w4a16_prefill_enabled()) { ++ return false; // default-off == stock ++ } ++ if (src1->type != GGML_TYPE_F32 || dst->type != GGML_TYPE_F32) { ++ return false; ++ } ++ // ne12 = total tokens (aggregate prefill M); only LARGE M (prefill), never decode. ++ if (src1->ne[2] <= ggml_cuda_w4a16_prefill_m()) { ++ return false; ++ } ++ const int64_t K = src0->ne[0]; ++ const int64_t N = src0->ne[1]; ++ if (N % 128 != 0 || K % 64 != 0) { ++ return false; // tile constraints; else fall back to per-expert/MMQ ++ } ++ return true; ++} ++ ++void ggml_cuda_mul_mat_id_w4a16_grouped( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, ++ const float * src1_sorted, ++ float * dst_sorted, ++ const int * tokens_per_expert, ++ int64_t n_experts, int64_t K, int64_t N, ++ cudaStream_t stream) { ++ GGML_ASSERT(src0->type == GGML_TYPE_NVFP4); ++ GGML_ASSERT(N % 128 == 0 && K % 64 == 0); ++ ++ constexpr int BM = 64, BN = 128, WARPS_M = 2, WARPS_N = 4, STAGES = 2; ++ ++ // host: build the per-M-tile expert map (ragged, no tile crosses an expert boundary) ++ int64_t total_rows = 0; ++ for (int64_t e = 0; e < n_experts; e++) { ++ total_rows += tokens_per_expert[e]; ++ } ++ if (total_rows == 0) { ++ return; ++ } ++ ++ std::vector h_tile_expert, h_tile_row0, h_tile_rows; ++ int64_t row = 0; ++ for (int64_t e = 0; e < n_experts; e++) { ++ const int t = tokens_per_expert[e]; ++ for (int off = 0; off < t; off += BM) { ++ h_tile_expert.push_back((int32_t) e); ++ h_tile_row0.push_back((int32_t) (row + off)); ++ h_tile_rows.push_back((int32_t) std::min(BM, t - off)); ++ } ++ row += t; ++ } ++ const int n_tiles = (int) h_tile_expert.size(); ++ ++ if (getenv("LLAMA_W4A16_DEBUG")) { ++ int max_tpe = 0, multi = 0; ++ for (int64_t e = 0; e < n_experts; e++) { ++ if (tokens_per_expert[e] > max_tpe) max_tpe = tokens_per_expert[e]; ++ if (tokens_per_expert[e] > BM) multi++; ++ } ++ fprintf(stderr, "[w4a16] engaged: total_rows=%lld n_experts=%lld K=%lld N=%lld n_tiles=%d max_tpe=%d multi_tile_experts=%d\n", ++ (long long) total_rows, (long long) n_experts, (long long) K, (long long) N, n_tiles, max_tpe, multi); ++ } ++ ++ // device: tile map ++ ggml_cuda_pool_alloc d_tile_expert(ctx.pool(), n_tiles); ++ ggml_cuda_pool_alloc d_tile_row0 (ctx.pool(), n_tiles); ++ ggml_cuda_pool_alloc d_tile_rows (ctx.pool(), n_tiles); ++ CUDA_CHECK(cudaMemcpyAsync(d_tile_expert.ptr, h_tile_expert.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream)); ++ CUDA_CHECK(cudaMemcpyAsync(d_tile_row0.ptr, h_tile_row0.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream)); ++ CUDA_CHECK(cudaMemcpyAsync(d_tile_rows.ptr, h_tile_rows.data(), n_tiles*sizeof(int32_t), cudaMemcpyHostToDevice, stream)); ++ ++ // activations: f32 -> bf16 (cheap cast, NO act-quant), zero-padded so every tile's BM-row read ++ // stays in-bounds. A tile's row0 is generally NOT BM-aligned (experts start mid-buffer), and a ++ // tile can begin as late as total_rows-1, so it can read up to total_rows-1+BM; add a full BM of ++ // zero headroom on top of the BM-rounded length to cover that worst case. ++ const int64_t pad_rows = (((total_rows + BM - 1) / BM) + 1) * BM; ++ ggml_cuda_pool_alloc Abf(ctx.pool(), (size_t) pad_rows * K); ++ { ++ const int64_t n = total_rows * K; ++ const int64_t npad = pad_rows * K; ++ const int threads = 256; ++ const int64_t grid = (npad + threads - 1) / threads; ++ w4a16_cast_act_f32_bf16<<>>(src1_sorted, Abf.get(), n, npad); ++ CUDA_CHECK(cudaGetLastError()); ++ } ++ ++ const int64_t expert_stride_blocks = (int64_t) (src0->nb[2] / sizeof(block_nvfp4)); ++ ++ auto kern = w4a16_grouped_kernel; ++ constexpr int STAGE_U32 = BM*(64/2) + BN*8 + BN; ++ const int smem_bytes = STAGES * STAGE_U32 * (int) sizeof(uint32_t); ++ CUDA_CHECK(cudaFuncSetAttribute(kern, cudaFuncAttributeMaxDynamicSharedMemorySize, smem_bytes)); ++ ++ dim3 grid((unsigned) (N / BN), (unsigned) n_tiles); ++ dim3 block(32, WARPS_M*WARPS_N); // 2D: threadIdx.x = warp lane, threadIdx.y = warp ++ kern<<>>( ++ Abf.get(), (const block_nvfp4 *) src0->data, dst_sorted, ++ d_tile_expert.ptr, d_tile_row0.ptr, d_tile_rows.ptr, ++ (int) N, (int) K, expert_stride_blocks); ++ CUDA_CHECK(cudaGetLastError()); ++} +diff --git a/ggml/src/ggml-cuda/w4a16-gemm.cuh b/ggml/src/ggml-cuda/w4a16-gemm.cuh +new file mode 100644 +index 0000000..2287d6f +--- /dev/null ++++ b/ggml/src/ggml-cuda/w4a16-gemm.cuh +@@ -0,0 +1,55 @@ ++#pragma once ++ ++#include "common.cuh" ++ ++// [paged patch 0035] Marlin-style W4A16 GROUPED MoE prefill GEMM for Blackwell sm_121a (GB10). ++// ++// This is the profile-validated #2 prefill lever and a DISTINCT kernel from the two prefill ++// rejects: ++// - NOT patch 0033 (separate-pass dequant -> bf16 cuBLAS / nvjet): that pays a full per-step ++// weight dequant + 4x bf16 weight traffic and lost to FP4-MMQ at large M. ++// - NOT patch 0034 (native W4A4 FP4-MMA, mxf4nvf4 block-scale OMMA): that quantizes the ++// activations to FP4 and so still pays the quantize_mmq_nvfp4 activation-quant tax. ++// ++// The winning shape vLLM uses on this silicon (Marlin W4A16): the FP4 expert weights are ++// dequantized to bf16 IN REGISTERS right before the MMA (never materialized to global/smem as ++// bf16), the activations stay bf16 (a cheap f32->bf16 cast, NO per-block FP4 amax/code-search ++// quantize), and the product is a standard bf16 m16n8k16 mma.sync feeding f32 accumulators, ++// cp.async multistage-pipelined over the K loop. So W4A16 pays ZERO activation-quant (the paged ++// FP4-MMQ path's quantize_mmq_nvfp4 is +15 us/tok) and the GEMM runs as a bf16 tensor-core GEMM ++// with the weight read at 4 bits. ++// ++// GROUPED: the kernel is launched ONCE over the whole mul_mat_id token-sorted activation buffer ++// (src1_sorted is already sorted-by-expert by the existing host-loop), with a per-M-tile expert ++// map so each output tile reads its expert's weight matrix (src0 + expert*nb02) and the ragged ++// per-expert row tail is masked. No per-expert kernel launch, no per-expert M-padding waste. ++// ++// Engages ONLY at large aggregate-M (prefill), behind LLAMA_W4A16_PREFILL_M (default 0 == OFF ++// == stock); decode (small ne12) and the non-MoE / non-NVFP4 paths are byte-untouched. The bf16 ++// tiles are mma.cuh's (mma.sync.aligned.m16n8k16.row.col.f32.bf16.bf16.f32). ++ ++// True if the grouped W4A16 path should handle this mul_mat_id: ++// src0 NVFP4, src1 f32, dst f32, Blackwell, LLAMA_W4A16_PREFILL_M>0, ++// ne12 (total tokens / aggregate prefill M) > threshold, N=ne0 % 128 == 0, K=ne10 % 64 == 0. ++bool ggml_cuda_w4a16_moe_grouped_should_engage( ++ const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * dst, int cc); ++ ++// True iff LLAMA_W4A16_PREFILL_M > 0 (the master on/off for the mmq.cu grouped-MMQ-off gate). ++bool ggml_cuda_w4a16_prefill_enabled(); ++int64_t ggml_cuda_w4a16_prefill_m(); ++ ++// Grouped W4A16 MoE GEMM over the token-sorted buffer. ++// src0 : NVFP4 weights [K, N, n_experts] (one [K,N] matrix per expert) ++// src1_sorted : f32 [K, total_rows], rows already sorted by expert (the mul_mat_id host-loop's ++// src1_sorted), with tokens_per_expert[e] consecutive rows per expert e ++// dst_sorted : f32 [N, total_rows], written in the same sorted order ++// tokens_per_expert : host vector, length n_experts ++// Streams on `stream`, pool-allocates scratch (bf16 activations + device tile map); no host sync. ++void ggml_cuda_mul_mat_id_w4a16_grouped( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, ++ const float * src1_sorted, ++ float * dst_sorted, ++ const int * tokens_per_expert, ++ int64_t n_experts, int64_t K, int64_t N, ++ cudaStream_t stream); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch new file mode 100644 index 000000000000..a3ded5d10994 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0040-feat-paged-S1-paged-decode-graph-reuse-across-servin.patch @@ -0,0 +1,322 @@ +From b81fa71360c3f6b46e97c6ad504efc10bdaea484 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 20:00:04 +0200 +Subject: [PATCH 40/41] feat(paged): S1 paged decode-graph reuse across serving + steps (patch 0040) + +The continuous-serving decode gap (paged ~3.7 vs vLLM ~5.9 tok/s/seq) is +host-bound: llama-context layer-A graph reuse was 0% in serving, so the host +rebuilt the ggml graph EVERY decode step (the +1.85 ms/step the Phase-0 profile +attributes to the rebuild; set_inputs/block-table are negligible). Root cause: +the paged decode inputs (input_block_table / input_gather_idxs in paged-attn.cpp) +never overrode llm_graph_input_i::can_reuse, which defaults to false - so any +graph carrying a paged input could never be reused, even with a constant batch +shape. (This is also why the paged decode graph rebuilt in static batched-bench.) + +S1 gives the paged inputs a correct can_reuse: + - reuse iff the input tensor dims are unchanged. The block table is + [n_view, n_stream] with n_view = PAD(n_gather, 256) clamped to n_kv, so it is + bucketed to 256 and stays constant across a 256-token decode window; n_stream + follows n_seqs_unq. The index CONTENTS are refilled at set_input on every step + (incl. reused steps), so a reused graph reads the current step's cells. + - the stored kv-cache context is refreshed from the owning attn input + (llm_graph_input_attn_kv, whose mctx is updated per-decode by attn_kv / + mem_hybrid can_reuse earlier in the input list), so a reused graph picks up the + live memory context. mem_hybrid::can_reuse now also refreshes inp_attn->mctx. + +Master switch paged_attn::decode_graph_reuse() (ON by default when paged; +LLAMA_PAGED_NO_GRAPH_REUSE=1 forces the pre-S1 rebuild-every-step path for A/B). +Also surfaces the run-wide graph-reuse rate in the [L5INSTR] exit line +(l5_add_proc) since llama-server does not print llama_perf. + +BIT-EXACT: greedy md5 byte-identical with reuse ON vs OFF on every path - +dense 5951a5b4d624ce891e22ab5fca9bc439, paged-MoE 8cb0ce23777bf55f92f63d0292c756b0. +Reuse only skips the host-side rebuild; set_inputs still re-runs every step. + +Measured (GB10): batched-bench paged decode graph reuse 0% -> 95.5% (hostproc +dense 3.31->2.66, MoE 2.44->1.82 ms/step); static throughput flat as expected +(static regime is GPU-bound). The serving payoff needs S3 (patch 0041): S1 alone +holds only 13.8% reuse in serving because co-batched prefill churns the shape +every step. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + src/llama-context.cpp | 3 ++ + src/llama-graph.cpp | 13 +++++- + src/paged-attn.cpp | 94 ++++++++++++++++++++++++++++++++++++++----- + src/paged-attn.h | 14 +++++++ + 4 files changed, 112 insertions(+), 12 deletions(-) + +diff --git a/src/llama-context.cpp b/src/llama-context.cpp +index c408eef..306a506 100644 +--- a/src/llama-context.cpp ++++ b/src/llama-context.cpp +@@ -1347,6 +1347,7 @@ bool llama_context::set_adapter_cvec( + + extern "C" void l5_add_setinp(double ns); + extern "C" void l5_add_hostproc(double ns); ++extern "C" void l5_add_proc(int reused); // [S1] per-step graph-reuse counter + static inline double l5c_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } + llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, llm_graph_type gtype, llama_memory_context_i * mctx, ggml_status & ret) { + double _l5_t0=l5c_now_ns(); +@@ -1374,7 +1375,9 @@ llm_graph_result * llama_context::process_ubatch(const llama_ubatch & ubatch, ll + } + + n_reused++; ++ l5_add_proc(1); + } else { ++ l5_add_proc(0); + res->reset(); + + ggml_backend_sched_reset(sched.get()); +diff --git a/src/llama-graph.cpp b/src/llama-graph.cpp +index 931258d..0337742 100644 +--- a/src/llama-graph.cpp ++++ b/src/llama-graph.cpp +@@ -699,6 +699,12 @@ bool llm_graph_input_mem_hybrid::can_reuse(const llm_graph_params & params) { + + this->mctx = mctx; + ++ // [S1] refresh the attn sub-input's memory context so paged decode inputs ++ // (which read owner->mctx in their can_reuse, run later in the input list) ++ // pick up the live per-decode context on a reused graph. Harmless for the ++ // non-paged path: inp_attn->mctx is only consumed at graph-build time there. ++ inp_attn->mctx = mctx->get_attn(); ++ + bool res = true; + + res &= inp_attn->self_k_idxs->ne[0] == params.ubatch.n_tokens; +@@ -2370,8 +2376,11 @@ ggml_tensor * llm_graph_context::build_attn( + ggml_tensor * kq_mask_g = kq_mask; + ggml_tensor * block_table = nullptr; + const bool is_decode = (q_cur->ne[2] == k->ne[3]); // 1 query token per stream +- if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, &k, &v, &kq_mask_g, &block_table))) { +- paged_attn::gather(ctx0, res, mctx_cur, &k, &v, &kq_mask_g); ++ // [S1] pass `inp` (the attn input) as the reuse owner: its mctx is refreshed ++ // per-decode by attn_kv/mem_hybrid can_reuse, and the paged inputs read it so ++ // a reused graph picks up the live memory context. ++ if (!(is_decode && paged_attn::in_kernel_decode(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g, &block_table))) { ++ paged_attn::gather(ctx0, res, mctx_cur, inp, &k, &v, &kq_mask_g); + } + + ggml_tensor * cur = build_attn_mha(q, k, v, kq_b, kq_mask_g, sinks, v_mla, kq_scale, il, block_table); +diff --git a/src/paged-attn.cpp b/src/paged-attn.cpp +index ebd92be..d543c7f 100644 +--- a/src/paged-attn.cpp ++++ b/src/paged-attn.cpp +@@ -11,9 +11,13 @@ + #include + namespace { static inline double l5_now_ns(){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC,&ts); return (double)ts.tv_sec*1e9+(double)ts.tv_nsec; } } + double g_l5_t_gbt=0, g_l5_t_setinp=0, g_l5_t_hostproc=0; long g_l5_n_gbt=0, g_l5_n_setinp=0, g_l5_n_hostproc=0; ++// [S1] graph-reuse counters across the whole run (the serving reuse-rate signal - ++// llama-server does not print llama_perf, so surface it here at process exit). ++long g_l5_n_proc=0, g_l5_n_reused=0; + extern "C" void l5_add_setinp(double ns){ g_l5_t_setinp+=ns; g_l5_n_setinp++; } + extern "C" void l5_add_hostproc(double ns){ g_l5_t_hostproc+=ns; g_l5_n_hostproc++; } +-namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0 ); } } g_l5_printer; } ++extern "C" void l5_add_proc(int reused){ g_l5_n_proc++; if (reused) g_l5_n_reused++; } ++namespace { struct L5Printer { ~L5Printer(){ fprintf(stderr,"[L5INSTR] get_block_table n=%ld sum=%.2fms mean=%.4fms | set_inputs n=%ld sum=%.2fms mean=%.4fms | hostproc n=%ld sum=%.2fms mean=%.4fms | graph_reuse %ld/%ld = %.1f%%\n", g_l5_n_gbt, g_l5_t_gbt/1e6, g_l5_n_gbt? g_l5_t_gbt/1e6/g_l5_n_gbt:0.0, g_l5_n_setinp, g_l5_t_setinp/1e6, g_l5_n_setinp? g_l5_t_setinp/1e6/g_l5_n_setinp:0.0, g_l5_n_hostproc, g_l5_t_hostproc/1e6, g_l5_n_hostproc? g_l5_t_hostproc/1e6/g_l5_n_hostproc:0.0, g_l5_n_reused, g_l5_n_proc, g_l5_n_proc? 100.0*g_l5_n_reused/g_l5_n_proc:0.0 ); } } g_l5_printer; } + + + namespace paged_attn { +@@ -28,17 +32,52 @@ static bool debug() { + return d; + } + ++// [S1] paged decode-graph reuse master switch. ON by default whenever paging is ++// active; LLAMA_PAGED_NO_GRAPH_REUSE=1 forces it off (A/B probe / safety hatch). ++bool decode_graph_reuse() { ++ static const bool on = active() && (std::getenv("LLAMA_PAGED_NO_GRAPH_REUSE") == nullptr); ++ return on; ++} ++ + namespace { + ++// [S1] Recompute the block-table view length the SAME way in_kernel_decode() ++// builds it, so can_reuse() can compare against the stored tensor dim. n_view is ++// PAD(n_gather,256) clamped to the physical window n_kv: it only changes when ++// n_gather crosses a 256 boundary, so a steady decode reuses across many steps. ++static inline int64_t paged_block_table_n_view(const llama_kv_cache_context * mctx) { ++ const int64_t n_gather = (int64_t) mctx->get_n_gather(); ++ if (n_gather <= 0) { ++ return 0; ++ } ++ int64_t n_view = GGML_PAD(n_gather, 256); ++ const int64_t n_kv = (int64_t) mctx->get_n_kv(); ++ if (n_view > n_kv) { ++ n_view = n_kv; ++ } ++ return n_view; ++} ++ ++// [S1] Number of attention streams the paged inputs build over - matches K->ne[3] ++// at build time and the n_stream used by can_reuse_kq_mask in llama-graph.cpp. ++static inline int64_t paged_n_stream(const llm_graph_params & params) { ++ return params.cparams.kv_unified ? 1 : (int64_t) params.ubatch.n_seqs_unq; ++} ++ + // Graph input that, at set_input time, fills an I32 [n_gather, n_stream] tensor + // with each stream's non-empty cell indices (position-sorted, padded with a +-// masked/empty cell) by delegating to the kv-cache context. Private to this +-// unit; default can_reuse()==false keeps the graph from being reused across +-// decodes (n_gather grows every step). ++// masked/empty cell) by delegating to the kv-cache context. Private to this unit. ++// ++// [S1] can_reuse: the graph topology depends only on the tensor SHAPE ++// [n_gather, n_stream] - the index CONTENTS are refilled at set_input every step, ++// so they need not match. n_gather is UNPADDED here (the gather path is used for ++// prefill / transposed-V fallback), so it grows every decode and reuse rarely ++// holds - correct and harmless. mctx is refreshed from the owning attn input ++// (whose mctx is updated by attn_kv/mem_hybrid can_reuse earlier in the input list). + class input_gather_idxs : public llm_graph_input_i { + public: +- input_gather_idxs(const llama_kv_cache_context * mctx, ggml_tensor * idxs) +- : mctx(mctx), idxs(idxs) {} ++ input_gather_idxs(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs) ++ : mctx(mctx), owner(owner), idxs(idxs) {} + + void set_input(const llama_ubatch * ubatch) override { + GGML_UNUSED(ubatch); +@@ -46,17 +85,37 @@ public: + mctx->get_gather_idxs((int32_t *) idxs->data); + } + ++ bool can_reuse(const llm_graph_params & params) override { ++ if (!owner || !paged_attn::decode_graph_reuse()) { ++ return false; ++ } ++ mctx = owner->mctx; // refresh to the live per-decode context ++ const int64_t n_gather = (int64_t) mctx->get_n_gather(); ++ if (n_gather <= 0) { ++ return false; ++ } ++ return idxs->ne[0] == n_gather && idxs->ne[1] == paged_n_stream(params); ++ } ++ + const llama_kv_cache_context * mctx; ++ const llm_graph_input_attn_kv * owner; + ggml_tensor * idxs; + }; + + // Block table filler for the in-kernel paged read: fills an I32 [n_blk, n_stream] + // tensor with each stream's position-ordered cells, padded to n_blk (per column) + // with a masked empty cell, by delegating to the kv-cache context. ++// ++// [S1] can_reuse: reuse iff the block-table tensor dims [n_view, n_stream] are ++// unchanged - n_view is bucketed to 256 (paged_block_table_n_view), so the decode ++// graph reuses across every step within a 256-token window. The table CONTENTS ++// are refilled at set_input on every step (incl. reused steps), so the reused ++// graph reads the current step's cells. mctx is refreshed from the owning attn ++// input so the reused graph's set_input/get_block_table uses the live context. + class input_block_table : public llm_graph_input_i { + public: +- input_block_table(const llama_kv_cache_context * mctx, ggml_tensor * idxs, uint32_t n_blk) +- : mctx(mctx), idxs(idxs), n_blk(n_blk) {} ++ input_block_table(const llama_kv_cache_context * mctx, const llm_graph_input_attn_kv * owner, ggml_tensor * idxs, uint32_t n_blk) ++ : mctx(mctx), owner(owner), idxs(idxs), n_blk(n_blk) {} + + void set_input(const llama_ubatch * ubatch) override { + GGML_UNUSED(ubatch); +@@ -66,7 +125,20 @@ public: + g_l5_t_gbt += l5_now_ns()-_t; g_l5_n_gbt++; + } + ++ bool can_reuse(const llm_graph_params & params) override { ++ if (!owner || !paged_attn::decode_graph_reuse()) { ++ return false; ++ } ++ mctx = owner->mctx; // refresh to the live per-decode context ++ const int64_t n_view = paged_block_table_n_view(mctx); ++ if (n_view <= 0 || n_view != (int64_t) n_blk) { ++ return false; ++ } ++ return idxs->ne[0] == n_view && idxs->ne[1] == paged_n_stream(params); ++ } ++ + const llama_kv_cache_context * mctx; ++ const llm_graph_input_attn_kv * owner; + ggml_tensor * idxs; + uint32_t n_blk; + }; +@@ -76,6 +148,7 @@ public: + void gather(ggml_context * ctx0, + llm_graph_result * res, + const llama_kv_cache_context * mctx, ++ const llm_graph_input_attn_kv * owner, + ggml_tensor ** k, + ggml_tensor ** v, + ggml_tensor ** kq_mask) { +@@ -114,7 +187,7 @@ void gather(ggml_context * ctx0, + // n_stream, so column s gathers from stream s of the source. + ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_gather, n_stream); + ggml_set_input(idx); +- res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, idx))); ++ res->add_input(llm_graph_input_ptr(new input_gather_idxs(mctx, owner, idx))); + + // --- gather K: collapse (head_dim, n_head) so cells become the row axis --- + { +@@ -156,6 +229,7 @@ void gather(ggml_context * ctx0, + bool in_kernel_decode(ggml_context * ctx0, + llm_graph_result * res, + const llama_kv_cache_context * mctx, ++ const llm_graph_input_attn_kv * owner, + ggml_tensor ** k, + ggml_tensor ** v, + ggml_tensor ** kq_mask, +@@ -221,7 +295,7 @@ bool in_kernel_decode(ggml_context * ctx0, + + ggml_tensor * idx = ggml_new_tensor_2d(ctx0, GGML_TYPE_I32, n_view, n_stream); + ggml_set_input(idx); +- res->add_input(llm_graph_input_ptr(new input_block_table(mctx, idx, (uint32_t) n_view))); ++ res->add_input(llm_graph_input_ptr(new input_block_table(mctx, owner, idx, (uint32_t) n_view))); + + // Present K and V as [d, h, n_view, ns] VIEWS of the full physical window: + // identical per-cell (nb1,nb2) and per-stream (nb3) strides, only the cell +diff --git a/src/paged-attn.h b/src/paged-attn.h +index 23e2184..fafe821 100644 +--- a/src/paged-attn.h ++++ b/src/paged-attn.h +@@ -21,18 +21,31 @@ struct ggml_context; + struct ggml_tensor; + class llm_graph_result; + class llama_kv_cache_context; ++class llm_graph_input_attn_kv; + + namespace paged_attn { + + // true iff env LLAMA_KV_PAGED is set (evaluated once). + bool active(); + ++// [S1] true iff the paged decode-graph reuse (layer-A can_reuse on the paged ++// inputs) is ENABLED. Default ON when active(); LLAMA_PAGED_NO_GRAPH_REUSE=1 ++// forces it off (A/B probe / safety hatch). When off the paged inputs keep the ++// stock default can_reuse()==false, i.e. the pre-S1 behaviour (rebuild every ++// step). Bit-exact either way - reuse only skips the host-side graph rebuild, ++// set_inputs still re-runs every step. ++bool decode_graph_reuse(); ++ + // Gather K, V and the kq_mask down to the current sequence's non-empty cells. + // No-op (returns immediately) unless active(). On return *k, *v and *kq_mask + // point at the compacted tensors; pass them straight to build_attn_mha. ++// `owner` is the attention input that owns the live (per-decode-refreshed) memory ++// context; the paged input reads owner->mctx in can_reuse so a reused graph picks ++// up the fresh context (see input_gather_idxs::can_reuse). May be null (no reuse). + void gather(ggml_context * ctx0, + llm_graph_result * res, + const llama_kv_cache_context * mctx, ++ const llm_graph_input_attn_kv * owner, + ggml_tensor ** k, + ggml_tensor ** v, + ggml_tensor ** kq_mask); +@@ -50,6 +63,7 @@ void gather(ggml_context * ctx0, + bool in_kernel_decode(ggml_context * ctx0, + llm_graph_result * res, + const llama_kv_cache_context * mctx, ++ const llm_graph_input_attn_kv * owner, + ggml_tensor ** k, + ggml_tensor ** v, + ggml_tensor ** kq_mask, +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch new file mode 100644 index 000000000000..801c2c574a5e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0041-feat-paged-S3-decode-shape-stable-scheduling-patch-0.patch @@ -0,0 +1,114 @@ +From ddff2279f23f18cadfbbb907a397d66b3609e9cd Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 20:00:24 +0200 +Subject: [PATCH 41/41] feat(paged): S3 decode-shape-stable scheduling (patch + 0041) + +The S1 paged decode-graph reuse (patch 0040) is necessary but not sufficient in +continuous serving: with cont_batching a co-batched prefill chunk inflates the +step from n_tokens==D (pure decode) to D+P, which changes the ubatch shape and +breaks llama-context layer-A reuse on (nearly) every step. Measured: S1 alone +holds only 13.8% graph reuse in a 128-client serving load. + +S3 makes the scheduler EMIT graph-reusable steps to match what S1 makes reusable. +While there is live decode load it runs PURE-decode steps (skip Phase-2 prompt +admission) so the decode batch shape stays constant, and admits a prefill chunk +only on a bounded cadence (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or +when no decode is active. The deferred prefill chunk still runs within at most +(period-1) decode steps, so prompt latency rises by a bounded amount. + +Pure policy change inside update_slots(), built on the patch-0016 decode-first +budget; no new slot states, no batch-formation rewrite, zero libllama changes. + +BIT-EXACT: only changes WHICH step a prompt chunk is admitted in. Each sequence's +decode logits depend on its own tokens + its own KV only (the paged decode read is +per-stream, attention is permutation-invariant over the co-batched set), so +deferring another slot's prefill never changes a generating slot's output. Does +not run in the single-sequence greedy md5 gate (that path is llama-completion). + +DEFAULT-OFF (A/B finding): a measured end-to-end A/B proved that making S3 +default-on under paged KV is a serving mistake. Deferring prefill admission on the +period-8 cadence defers prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) +and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. +Its apparent decode_agg gain was a metric artifact (faster per-step decode bought +by starving prefill). So S3 now defaults OFF (prefer prompt prefill admission for +good TTFT) and is opt-in via LLAMA_PAGED_DECODE_STABLE=1, intended only for +decode-dominated, low-arrival traffic where TTFT is not a concern. With +LLAMA_PAGED_DECODE_STABLE unset => byte-identical to patch 0016. + +Measured (GB10, MoE Qwen3.6-35B-A3B-NVFP4, 128-client staggered streaming load): +S1+S3 vs baseline (graphs rebuilt every step): graph reuse 0% -> 72.2%, hostproc +15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, +at vLLM's ~5.9 sustained). NOTE these are per-step decode metrics; the A/B above +shows they do not translate to an end-to-end serving win, hence default-off. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tools/server/server-context.cpp | 46 ++++++++++++++++++++++++++++++++- + 1 file changed, 45 insertions(+), 1 deletion(-) + +diff --git a/tools/server/server-context.cpp b/tools/server/server-context.cpp +index 64775dc..a77e267 100644 +--- a/tools/server/server-context.cpp ++++ b/tools/server/server-context.cpp +@@ -3138,11 +3138,55 @@ private: + } + int32_t n_prompt_budgeted = 0; // prompt tokens added to the batch this step (across slots) + ++ // PAGED serving lever (patch 0041, S3): decode-shape-stable scheduling. ++ // Pairs with the S1 paged decode-graph reuse (patch 0040): S1 makes a ++ // pure-decode step graph-reusable, S3 makes the scheduler EMIT pure-decode ++ // steps. With continuous batching a co-batched prefill chunk inflates the ++ // step from n_tokens==D (pure decode) to D+P, which changes the ubatch ++ // shape and breaks layer-A graph reuse on EVERY step. S3 keeps prefill out ++ // of the decode step: while there is live decode load it runs pure-decode ++ // steps (reuse holds) and admits a prefill chunk only on a bounded cadence ++ // (every LLAMA_PAGED_PREFILL_PERIOD steps, default 8) or when no decode is ++ // active. The deferred prefill chunk still runs within a few steps, so ++ // prompt latency rises by at most (period-1) decode steps. ++ // ++ // BIT-EXACT: this only changes WHICH step a prompt chunk is admitted in. ++ // Each sequence's decode logits depend on its own tokens + its own KV only ++ // (the paged decode read is per-stream, attention is permutation-invariant ++ // over the co-batched set), so deferring another slot's prefill never ++ // changes a generating slot's output. Does not run in the single-sequence ++ // greedy md5 gate (that path is llama-completion, not update_slots). ++ // ++ // DEFAULT-OFF (A/B finding): an end-to-end A/B proved S3-on is a serving ++ // mistake. Deferring prefill admission on the period-8 cadence delays prompt ++ // admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end ++ // throughput, with no end-to-end win at any concurrency. Its apparent ++ // decode_agg gain was a metric artifact (faster per-step decode bought by ++ // starving prefill). So the default prefers prompt prefill admission for good ++ // TTFT; S3 is opt-in (LLAMA_PAGED_DECODE_STABLE=1) only for decode-dominated, ++ // low-arrival traffic where TTFT is not a concern. ++ bool decode_only_step = false; ++ { ++ static const int s3_enabled = [](){ ++ const char * e = getenv("LLAMA_PAGED_DECODE_STABLE"); ++ return e ? atoi(e) : 0; // default OFF; opt-in via LLAMA_PAGED_DECODE_STABLE=1 ++ }(); ++ if (s3_enabled && n_decode_in_batch > 0) { ++ static const int s3_period = [](){ const char * e = getenv("LLAMA_PAGED_PREFILL_PERIOD"); int p = e ? atoi(e) : 8; return p > 0 ? p : 8; }(); ++ static long s3_step = 0; ++ const bool prefill_due = (s3_step % s3_period) == 0; ++ s3_step++; ++ decode_only_step = !prefill_due; ++ } ++ } ++ + auto & alora_scale = batch.alora_scale; + auto & alora_disabled_id = batch.alora_disabled_id; + + // next, batch any pending prompts without exceeding n_batch +- if (params_base.cont_batching || batch.size() == 0) { ++ // (patch 0041, S3) skip prompt admission on a pure-decode step to keep the ++ // decode batch shape reuse-stable ++ if ((params_base.cont_batching || batch.size() == 0) && !decode_only_step) { + bool add_ok = true; // false means the batch is full, skip remaining slots + + iterate(slots, [&](server_slot & slot) { +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch new file mode 100644 index 000000000000..4baac0e4c6a4 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0042-feat-paged-fused-residual-add-RMS-norm-weight-multip.patch @@ -0,0 +1,365 @@ +From 1434cf7e078217c625062dcfde4fa91cf487ee86 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Sun, 28 Jun 2026 20:19:31 +0200 +Subject: [PATCH] feat(paged): fused residual-add + RMS norm + weight multiply + (patch 0042) + +The transformer pre-norm residual chain `h = x + sub_out; n = rms_norm(h) * w` +runs as separate CUDA launches in the paged prefill graph: a k_bin_bcast ADD +(the residual) feeding the existing fused rms_norm+mul. ggml-cuda already fuses +rms_norm+mul (and rms_norm+mul+ADD, where the ADD is a *post*-norm bias) but NOT +the *pre*-norm residual add that feeds the norm. This is the classic add-RMSNorm +fusion (as in vLLM / TensorRT-LLM) that ggml-cuda lacks; it is part of the +unfused-tail prefill gap vs vLLM's torch.compile fusions. + +Add it as a CUDA-family graph fusion (paged series owns it; stock stays pure): +- ggml_cuda_can_fuse recognizes { ADD, RMS_NORM, MUL } via ggml_can_fuse_subgraph + with BOTH the ADD (node_idx) and the MUL (node_idx+2) marked as outputs - the + residual ADD has a second consumer (the later skip-connection add), so it + cannot pass the single-use ggml_can_fuse() gate the other rms_norm fusions use. +- New kernel rms_norm_pre_add_mul_f32 computes h = a + b, publishes h to the + residual buffer (downstream skip add reads it), then sum(h^2) -> scale -> + dst = scale * h * w in ONE launch, emitting BOTH outputs the graph needs. +- Gated by LLAMA_FUSE_ADD_RMSNORM (default ON) for a clean single-build A/B. + +BIT-EXACT (per-path canonical greedy md5, n=48 --temp 0 --seed 1, paged): + dense q36-27b-nvfp4 : 5951a5b4d624ce891e22ab5fca9bc439 (ON == OFF == canonical) + MoE q36-35b-a3b : 8cb0ce23777bf55f92f63d0292c756b0 (ON == OFF == canonical) +The fused kernel reproduces the exact FP order of the unfused chain: h = a + b +(IEEE add is order-free), the sum(h^2) reduction uses the same block_reduce +with the same 256/1024 block-size thresholds, and the same rsqrtf(mean+eps) +scale, so the byte stream is unchanged. test-backend-ops RMS_NORM/ADD/MUL pass +(CUDA0 vs CPU). + +PROFILE (dense prefill, nsys --cuda-graph-trace=node, npp512 ntg4 npl8): + rms_norm_f32<1024> 903 launches / 96.6M ns -> 7 / 0.7M ns + k_bin_bcast 1232 launches / 138.6M ns -> 336 / 1.0M ns + rms_norm_pre_add_mul (new) 896 launches / 187.2M ns + -> 896 residual-add + 896 rms_norm launches folded into 896 fused launches; + the norm+residual slice 233.6M -> 187.2M ns (~20% of that slice, ~1% of + total prefill GPU time). +S_PP dense (npp512 ntg4 npl32, 3x): 985.5 -> 990.6 t/s (+0.5%, every ON run +beats every OFF run). Modest because the residual tail is a small slice of +prefill; the dominant unfused cost is k_bin_bcast (11%, the GDN +chunked-prefill gating muls) - a separate lever. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 54 +++++++++ + ggml/src/ggml-cuda/norm.cu | 196 ++++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/norm.cuh | 5 + + 3 files changed, 255 insertions(+) + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 0dad6e1..2ecc971 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -3698,6 +3698,48 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, + } + } + ++ // Fused residual-add + RMS norm + weight multiply. The transformer residual ++ // ADD feeds the next sublayer's RMS norm but is ALSO consumed by the later ++ // residual add (skip connection), so the ADD node is a graph output too; it ++ // cannot go through the single-use ggml_can_fuse() gate below. Recognize it ++ // here with ggml_can_fuse_subgraph, marking both the ADD (node_idx) and the ++ // final MUL (node_idx + 2) as outputs. ++ std::initializer_list add_rms_norm_mul_ops = { GGML_OP_ADD, GGML_OP_RMS_NORM, GGML_OP_MUL }; ++ if (is_equal(add_rms_norm_mul_ops, ops) && ++ ggml_can_fuse_subgraph(cgraph, node_idx, ops, { node_idx, node_idx + 2 })) { ++ const ggml_tensor * add = cgraph->nodes[node_idx]; ++ const ggml_tensor * rms_norm = cgraph->nodes[node_idx + 1]; ++ const ggml_tensor * mul = cgraph->nodes[node_idx + 2]; ++ ++ // RMS norm must consume the residual-add output. ++ if (rms_norm->src[0] != add) { ++ return false; ++ } ++ // All operands F32 (rms norm / fused mul kernel only support F32). ++ if (add->src[0]->type != GGML_TYPE_F32 || add->src[1]->type != GGML_TYPE_F32 || ++ add->type != GGML_TYPE_F32 || rms_norm->type != GGML_TYPE_F32 || ++ mul->src[0]->type != GGML_TYPE_F32 || mul->src[1]->type != GGML_TYPE_F32 || ++ mul->type != GGML_TYPE_F32) { ++ return false; ++ } ++ // The fused kernel computes h = a + b elementwise: same shape, no broadcast. ++ if (!ggml_are_same_shape(add->src[0], add->src[1])) { ++ return false; ++ } ++ // rms_norm kernel assumes contiguous rows for the residual operands and weight. ++ if (!ggml_is_contiguous(add->src[0]) || !ggml_is_contiguous(add->src[1])) { ++ return false; ++ } ++ if (!ggml_is_contiguous_rows(mul->src[0]) || !ggml_is_contiguous_rows(mul->src[1])) { ++ return false; ++ } ++ // If rms_norm is the B operand of the mul, broadcast of the A operand is unsupported. ++ if (rms_norm == mul->src[1] && !ggml_are_same_shape(mul->src[0], rms_norm)) { ++ return false; ++ } ++ return true; ++ } ++ + if (!ggml_can_fuse(cgraph, node_idx, ops)) { + return false; + } +@@ -4220,6 +4262,18 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph + return fused_node_count - 1; + } + ++ // Fused residual-add + RMS norm + weight multiply (bit-exact). Default ON; ++ // set LLAMA_FUSE_ADD_RMSNORM=0 for a clean A/B against the unfused path. ++ static const bool fuse_add_rmsnorm = [] { ++ const char * e = getenv("LLAMA_FUSE_ADD_RMSNORM"); ++ return e == nullptr || atoi(e) != 0; ++ }(); ++ if (fuse_add_rmsnorm && ++ ggml_cuda_can_fuse(cgraph, i, { GGML_OP_ADD, GGML_OP_RMS_NORM, GGML_OP_MUL }, {})) { ++ ggml_cuda_op_rms_norm_pre_add_mul(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]); ++ return 2; ++ } ++ + if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ADD }, {})) { + ggml_cuda_op_rms_norm_fused_add(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]); + return 2; +diff --git a/ggml/src/ggml-cuda/norm.cu b/ggml/src/ggml-cuda/norm.cu +index 09d9f3a..a07d022 100644 +--- a/ggml/src/ggml-cuda/norm.cu ++++ b/ggml/src/ggml-cuda/norm.cu +@@ -154,6 +154,87 @@ static __global__ void rms_norm_f32(const float * x, + } + } + ++// Fused residual-add + RMS norm + (optional) weight multiply. ++// h = a + b (the residual stream, written to h_out) ++// dst = rsqrt(mean(h^2)+eps) * h * mul ++// `a` and `b` are required to be the same shape and contiguous (the transformer ++// residual add), so they share `x`'s strides; `h_out`, `dst` are also contiguous ++// with that shape. `mul` (the RMS weight) broadcasts via the packed-modulo path. ++// ++// Bit-exactness: this reproduces the exact FP order of the unfused chain ++// k_bin_bcast(add): h[col] = a[col] + b[col] (f32, elementwise, order-free) ++// rms_norm: sumsq over h[col] in column order via block_reduce ++// mul: dst[col] = scale * h[col] * mul[col] ++// h is summed from the same f32 values in the same order, so the reduction and ++// the final scale are byte-identical to running the three kernels separately. ++template ++static __global__ void rms_norm_pre_add_mul_f32(const float * a, ++ const float * b, ++ float * h_out, ++ float * dst, ++ const int ncols, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float eps, ++ const float * mul = nullptr, ++ const int64_t mul_stride_row = 0, ++ const int64_t mul_stride_channel = 0, ++ const int64_t mul_stride_sample = 0, ++ const uint3 mul_ncols_packed = make_uint3(0, 0, 0), ++ const uint3 mul_nrows_packed = make_uint3(0, 0, 0), ++ const uint3 mul_nchannels_packed = make_uint3(0, 0, 0), ++ const uint3 mul_nsamples_packed = make_uint3(0, 0, 0)) { ++ ggml_cuda_pdl_lc(); ++ const int nrows = gridDim.x; ++ const int nchannels = gridDim.y; ++ ++ const int row = blockIdx.x; ++ const int channel = blockIdx.y; ++ const int sample = blockIdx.z; ++ const int tid = threadIdx.x; ++ ++ const int64_t row_offset = sample*stride_sample + channel*stride_channel + row*stride_row; ++ a += row_offset; ++ b += row_offset; ++ h_out += row_offset; ++ // dst is laid out contiguously by the scheduler for the MUL output ++ dst += ((sample*nchannels + channel)*nrows + row)*ncols; ++ ++ if constexpr (do_multiply) { ++ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed); ++ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed); ++ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed); ++ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row; ++ } ++ ++ float tmp = 0.0f; // partial sum for thread in warp ++ ++ ggml_cuda_pdl_sync(); ++ for (int col = tid; col < ncols; col += block_size) { ++ const float hi = a[col] + b[col]; ++ h_out[col] = hi; // publish the residual stream for the next add ++ tmp += hi * hi; ++ } ++ ++ // sum up partial sums ++ extern __shared__ float s_sum[]; ++ tmp = block_reduce(tmp, s_sum); ++ ++ const float mean = tmp / ncols; ++ const float scale = rsqrtf(mean + eps); ++ ++ for (int col = tid; col < ncols; col += block_size) { ++ const float hi = h_out[col]; ++ if constexpr (do_multiply) { ++ const int mul_col = fastmodulo(col, mul_ncols_packed); ++ dst[col] = scale * hi * mul[mul_col]; ++ } else { ++ dst[col] = scale * hi; ++ } ++ } ++} ++ + template + static __global__ void rms_norm_back_f32( + const float * grad, const float * xf, float * dst, const int ncols, const float eps) { +@@ -407,6 +488,50 @@ static void rms_norm_mul_f32_cuda(const float * x, + } + } + ++static void rms_norm_pre_add_mul_f32_cuda(const float * a, ++ const float * b, ++ float * h_out, ++ float * dst, ++ const int ncols, ++ const int nrows, ++ const int nchannels, ++ const int nsamples, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint32_t mul_ncols, ++ const uint32_t mul_nrows, ++ const uint32_t mul_nchannels, ++ const uint32_t mul_nsamples, ++ const float eps, ++ cudaStream_t stream) { ++ const dim3 blocks_num(nrows, nchannels, nsamples); ++ GGML_ASSERT(mul != nullptr); ++ const uint3 mul_ncols_packed = init_fastdiv_values(mul_ncols); ++ const uint3 mul_nrows_packed = init_fastdiv_values(mul_nrows); ++ const uint3 mul_nchannels_packed = init_fastdiv_values(mul_nchannels); ++ const uint3 mul_nsamples_packed = init_fastdiv_values(mul_nsamples); ++ if (ncols < 1024) { ++ const dim3 block_dims(256, 1, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_pre_add_mul_f32<256, true>, launch_params, ++ a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, ++ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed); ++ } else { ++ const dim3 block_dims(1024, 1, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_pre_add_mul_f32<1024, true>, launch_params, ++ a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, ++ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed); ++ } ++} ++ + static void rms_norm_back_f32_cuda(const float * grad, const float * xf, float * dst, const int ncols, const int nrows, const float eps, cudaStream_t stream) { + if (ncols < 1024) { + const dim3 block_dims(WARP_SIZE, 1, 1); +@@ -647,6 +772,77 @@ void ggml_cuda_op_rms_norm_fused_add(ggml_backend_cuda_context & ctx, + eps, stream); + } + ++void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx, ++ ggml_tensor * add_tensor, ++ ggml_tensor * rms_norm_tensor, ++ ggml_tensor * mul_tensor) { ++ // The RMS norm consumes the residual-add output. ++ GGML_ASSERT(rms_norm_tensor->src[0] == add_tensor); ++ ++ const ggml_tensor * a_src = add_tensor->src[0]; ++ const ggml_tensor * b_src = add_tensor->src[1]; ++ ++ float eps = 0.0f; ++ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float)); ++ GGML_ASSERT(eps >= 0.0f); ++ ++ const float * a_d = (const float *) a_src->data; ++ const float * b_d = (const float *) b_src->data; ++ float * h_d = (float *) add_tensor->data; ++ ++ const float * mul_d = nullptr; ++ const ggml_tensor * mul_src = nullptr; ++ if (mul_tensor->src[0] == rms_norm_tensor) { ++ mul_d = (const float *) mul_tensor->src[1]->data; ++ mul_src = mul_tensor->src[1]; ++ } else if (mul_tensor->src[1] == rms_norm_tensor) { ++ mul_d = (const float *) mul_tensor->src[0]->data; ++ mul_src = mul_tensor->src[0]; ++ } else { ++ GGML_ASSERT(false); ++ } ++ ++ float * dst_d = (float *) mul_tensor->data; ++ cudaStream_t stream = ctx.stream(); ++ ++ GGML_ASSERT(a_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(b_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(add_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(ggml_are_same_shape(a_src, b_src)); ++ ++ const int64_t ne00 = add_tensor->ne[0]; ++ const int64_t ne01 = add_tensor->ne[1]; ++ const int64_t ne02 = add_tensor->ne[2]; ++ const int64_t ne03 = add_tensor->ne[3]; ++ ++ // a and b share the (contiguous) residual layout ++ const size_t ts0 = ggml_type_size(a_src->type); ++ GGML_ASSERT(a_src->nb[0] == ts0 && b_src->nb[0] == ts0); ++ const int64_t s01 = a_src->nb[1] / ts0; ++ const int64_t s02 = a_src->nb[2] / ts0; ++ const int64_t s03 = a_src->nb[3] / ts0; ++ ++ const size_t ts_mul = ggml_type_size(mul_src->type); ++ GGML_ASSERT(mul_src->nb[0] == ts_mul); ++ const int64_t mul_s01 = mul_src->nb[1] / ts_mul; ++ const int64_t mul_s02 = mul_src->nb[2] / ts_mul; ++ const int64_t mul_s03 = mul_src->nb[3] / ts_mul; ++ ++ const int mul_ncols = mul_src->ne[0]; ++ const int mul_nrows = mul_src->ne[1]; ++ const int mul_nchannels = mul_src->ne[2]; ++ const int mul_nsamples = mul_src->ne[3]; ++ ++ rms_norm_pre_add_mul_f32_cuda(a_d, b_d, h_d, dst_d, ++ ne00, ne01, ne02, ne03, ++ /*s00*/ s01, s02, s03, ++ mul_d, /*mul_s00*/ mul_s01, mul_s02, mul_s03, ++ mul_ncols, mul_nrows, mul_nchannels, mul_nsamples, ++ eps, stream); ++} ++ + void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { + const ggml_tensor * grad = dst->src[0]; // gradients + const ggml_tensor * src0f = dst->src[1]; // src0 from forward pass +diff --git a/ggml/src/ggml-cuda/norm.cuh b/ggml/src/ggml-cuda/norm.cuh +index a74f637..05396cd 100644 +--- a/ggml/src/ggml-cuda/norm.cuh ++++ b/ggml/src/ggml-cuda/norm.cuh +@@ -13,6 +13,11 @@ void ggml_cuda_op_rms_norm_fused_add(ggml_backend_cuda_context & ctx, + ggml_tensor * mul_tensor, + ggml_tensor * add_tensor); + ++void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx, ++ ggml_tensor * add_tensor, ++ ggml_tensor * rms_norm_tensor, ++ ggml_tensor * mul_tensor); ++ + void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst); + + void ggml_cuda_op_l2_norm(ggml_backend_cuda_context & ctx, ggml_tensor * dst); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch new file mode 100644 index 000000000000..3c73c1ce30e9 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0043-feat-paged-default-on-full-step-moe-decode-cuda-graph.patch @@ -0,0 +1,82 @@ +From e4716bd0c700d34919e093f99cd454d883ad15ec Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Mon, 29 Jun 2026 02:13:20 +0200 +Subject: [PATCH] feat(paged): default-on full-step MoE-decode CUDA graph + (grouped MMQ, patch 0043) + +D1 lever. The MUL_MAT_ID CUDA-graph guard ([TAG_MUL_MAT_ID_CUDA_GRAPHS]) +disables CUDA graphs for the WHOLE decode step whenever a MUL_MAT_ID node has +ne[2] > mmvq_mmid_max (8 for NVFP4 on sm_121) - i.e. for every multi-token +decode. Patch 0025 showed the path actually taken on Blackwell NVFP4, +should_use_mmq()==true -> grouped stream-k MMQ id-branch, launches on one +stream with NO host sync (only the per-expert host-loop fallback synchronizes), +so the disable is conservative and graphs are safe for the grouped path - but +0025 left it behind an opt-in env (LLAMA_MOE_FORCE_GRAPHS), so by default the +host re-issued every kernel of the step. + +D1 profiling (GB10 sm_121, q36-35b-a3b-nvfp4, batched-bench -fa on, npl128) +settled the mechanism: + - The grouped MMQ NVFP4 path IS what runs in decode: cudaStreamSynchronize + count is IDENTICAL with graphs on vs off (1457 either way) - the per-expert + host-loop fallback (the only device->host routing readback) is never hit. + MoE routing is already device-side. + - Steady-decode GPU-busy is ~99% (1% idle): static decode is GPU-bound, not + host-sync-bound. The host cost is per-step kernel RE-ISSUE, removed by + replaying a captured full-step graph (incl. the MoE dispatch). + +So make the grouped-path graph capture ON BY DEFAULT; LLAMA_MOE_NO_FORCE_GRAPHS=1 +forces the conservative pre-0025 disable for A/B. should_use_mmq() is the exact +guard: it returns FALSE for the large-M NVFP4 prefill (patch 0034), which +deliberately drops to the per-expert host-sync loop, so PREFILL keeps graphs +disabled (correct - that path syncs). Decode-only behaviour change; prefill and +the stock llama-cpp backend are untouched. + +BIT-EXACT: greedy md5 byte-identical default(on)==LLAMA_MOE_NO_FORCE_GRAPHS(off) +==legacy LLAMA_MOE_FORCE_GRAPHS - paged-MoE 8cb0ce23777bf55f92f63d0292c756b0, +paged-dense 5951a5b4d624ce891e22ab5fca9bc439 (both match the recorded baselines). + +Measured (GB10, batched-bench paged decode S_TG, default-on vs opt-out): + npl 32 467.3 vs 444.3 t/s +5.2% + npl 128 788.2 vs 768.1 t/s +2.6% + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 20 +++++++++++++++----- + 1 file changed, 15 insertions(+), 5 deletions(-) + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index a92003c..3151684 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -3306,12 +3306,22 @@ static bool ggml_cuda_graph_check_compability(ggml_cgraph * cgraph) { + const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; + const int mmvq_mmid_max = get_mmvq_mmid_max_batch(node->src[0]->type, cc); + bool mmid_needs_sync = !ggml_is_quantized(node->src[0]->type) || node->ne[2] > mmvq_mmid_max; +- // PROBE (bit-exact, env LLAMA_MOE_FORCE_GRAPHS): the grouped stream-k MMQ id-path is +- // launched on-stream with no host sync (only the per-expert host-loop fallback syncs); +- // when should_use_mmq() is true (Blackwell NVFP4 grouped path) the op is graph-safe +- // even for ne[2] > mmvq_mmid_max, so graphs need not be disabled for the whole step. ++ // [D1 / patch 0043] The grouped stream-k MMQ id-path (should_use_mmq()==true, e.g. ++ // Blackwell NVFP4) launches on-stream with NO host sync; only the per-expert ++ // host-loop fallback synchronizes the stream. So when this MUL_MAT_ID WILL take the ++ // grouped path, the whole decode step is graph-safe even for ne[2] > mmvq_mmid_max, ++ // and the full-step CUDA graph (incl. the MoE dispatch) can be REPLAYED instead of the ++ // host re-issuing every kernel every step. Patch 0025 proved this is bit-exact (graph ++ // replay re-issues identical kernels); D1 profiling confirmed the grouped path is what ++ // actually runs (no device->host routing readback), that steady decode is ~99% GPU-busy ++ // (not host-sync-bound), and that keeping the step graphed lifts throughput (npl32 ++ // +13%, npl128 +1.9%). It is therefore ON BY DEFAULT for the grouped path now. ++ // should_use_mmq() is the exact guard: it returns FALSE for the large-M NVFP4 prefill ++ // (patch 0034) that deliberately drops to the per-expert host-sync loop, so PREFILL ++ // keeps graphs disabled (correct - that path syncs). Decode is untouched by 0034. ++ // LLAMA_MOE_NO_FORCE_GRAPHS=1 forces the conservative pre-0025 disable for A/B. + if (mmid_needs_sync && ggml_is_quantized(node->src[0]->type) && +- getenv("LLAMA_MOE_FORCE_GRAPHS") != nullptr && ++ getenv("LLAMA_MOE_NO_FORCE_GRAPHS") == nullptr && + ggml_cuda_should_use_mmq(node->src[0]->type, cc, node->src[1]->ne[2], node->src[0]->ne[2])) { + mmid_needs_sync = false; + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch new file mode 100644 index 000000000000..3b2aa9b98138 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0044-feat-paged-fused-gated-RMSNorm-SiLU-gate-mul.patch @@ -0,0 +1,470 @@ +From 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 10:57:05 +0200 +Subject: [PATCH] feat(paged): fused gated RMSNorm + SiLU gate-mul CUDA op + (patch 0044) + +The Qwen3.6 gated-DeltaNet output norm self.norm(core_attn_out, z) +(qwen35 / qwen35moe build_norm_gated) runs as (rms_norm(x) * w) * silu(z): +on CUDA that was rms_norm_mul + silu_mul, two fused launches with the +normalized intermediate round-tripping through HBM. Fuse the whole chain +into one kernel so it stays in registers. This is the gated-RMSNorm fusion +the vLLM decode-gap analysis ranked #1 (the easy, bit-exact prefill win), +a direct sibling of patch 0042 (add-RMSNorm). + +The chain is NOT naturally consecutive in the graph: the gate z-projection +(a MUL_MAT) is scheduled between the weight MUL and the SILU, so the default +mul(normalized, silu(z)) order leaves a GEMM between them and cannot be +fused. build_norm_gated now emits the gate multiply as mul(silu(z), +normalized) (commutative, so bit-exact), which lays the chain out as the +consecutive subgraph { SILU, RMS_NORM, MUL, MUL } that ggml-cuda can fuse. + +- New kernel rms_norm_gate_mul_f32 (ggml/src/ggml-cuda/norm.cu): same + block_reduce over x^2, same 256/1024 block-size thresholds and + rsqrtf(mean+eps) as rms_norm / patch 0042; the final write computes + dst = scale * x * w * silu(z) with silu(z) = z/(1+expf(-z)) (the exact + ggml_cuda_op_silu_single form). w (the RMS weight) and z (the gate) both + broadcast via the packed-modulo helper. +- ggml_cuda_can_fuse recognizes { GGML_OP_UNARY(SILU), RMS_NORM, MUL, MUL } + via ggml_can_fuse_subgraph with the final MUL as the only output (the SILU + reads an external gate; RMS_NORM and the weight MUL are single-use within). +- Gated by LLAMA_FUSE_GATE_RMSNORM (default ON) for a clean single-build A/B; + OFF keeps the original operand order AND the unfused kernels, so OFF is + byte- and kernel-identical to the pre-patch path. + +BIT-EXACT (per-path canonical greedy md5, n=48 --temp 0 --seed 1): + dense q36-27b-nvfp4 : 5951a5b4d624ce891e22ab5fca9bc439 (ON == OFF == canonical, paged and non-paged) + MoE q36-35b-a3b : 8cb0ce23777bf55f92f63d0292c756b0 (ON == OFF == canonical, paged) +Multiply is commutative, so ((scale*x)*w)*silu(z) is byte-identical to the +unfused silu(z)*((scale*x)*w); the sum(x^2) reduction and rsqrt scale are +unchanged. test-backend-ops 12979/12979 (CUDA0 vs CPU). + +PROFILE (dense prefill, nsys --cuda-graph-trace=node, npp512 ntg4 npl8): + rms_norm_f32<256,1,0> 560 -> 224 launches + unary_gated_op_kernel 784 -> 448 launches + rms_norm_gate_mul_f32 (new) 336 launches / 69.7M ns + -> the 336 gated-norm rms_norm_mul + 336 silu_mul launches (672) fold into + 336 fused launches, removing the normalized HBM round-trip. +S_PP (npp512 ntg4 npl32, 3x interleaved A/B, every ON beats every OFF): + dense q36-27b : 1002.5 -> 1013.4 t/s (+1.1%, ~+10 us/tok) + MoE q36-35b : 2626.9 -> 2651.8 t/s (+0.9%) + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 67 ++++++++++ + ggml/src/ggml-cuda/norm.cu | 215 ++++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/norm.cuh | 6 + + src/models/qwen35.cpp | 16 +++ + src/models/qwen35moe.cpp | 16 +++ + 5 files changed, 320 insertions(+) + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 42bcd4a77..374949f25 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -3816,6 +3816,60 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, + return true; + } + ++ // Fused gated RMS norm: SiLU gate multiply over (RMS norm * weight), the ++ // gated-DeltaNet output norm `out = (rms_norm(x) * w) * silu(z)` of the Qwen3.6 ++ // hybrid models (qwen35 / qwen35moe build_norm_gated). The model emits the gate ++ // multiply as mul(silu(z), normalized) (default; see LLAMA_FUSE_GATE_RMSNORM in ++ // build_norm_gated) so the chain forms the consecutive subgraph ++ // { SILU, RMS_NORM, MUL, MUL } - the gate z-projection is scheduled before the ++ // SILU, so the natural mul(normalized, silu) order leaves a GEMM between the ++ // weight MUL and the SILU and cannot be fused. The SILU (node_idx) reads an ++ // external gate and the final gate MUL (node_idx + 3) feeds the o_proj, so mark ++ // node_idx + 3 as the only output; the RMS_NORM (node_idx + 1) and weight MUL ++ // (node_idx + 2) are single-use within the subgraph. ++ std::initializer_list rms_norm_gate_mul_ops = { GGML_OP_UNARY, GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_MUL }; ++ if (is_equal(rms_norm_gate_mul_ops, ops) && unary_ops.size() == 1 && unary_ops.begin()[0] == GGML_UNARY_OP_SILU && ++ ggml_can_fuse_subgraph(cgraph, node_idx, ops, { node_idx + 3 })) { ++ const ggml_tensor * silu = cgraph->nodes[node_idx]; ++ const ggml_tensor * rms_norm = cgraph->nodes[node_idx + 1]; ++ const ggml_tensor * mul = cgraph->nodes[node_idx + 2]; ++ const ggml_tensor * gate_mul = cgraph->nodes[node_idx + 3]; ++ ++ if (ggml_get_unary_op(silu) != GGML_UNARY_OP_SILU) { ++ return false; ++ } ++ // The weight MUL must consume the RMS norm output; the gate MUL must ++ // consume both the weight MUL and the SILU output. ++ if (mul->src[0] != rms_norm && mul->src[1] != rms_norm) { ++ return false; ++ } ++ if ((gate_mul->src[0] != mul && gate_mul->src[1] != mul) || ++ (gate_mul->src[0] != silu && gate_mul->src[1] != silu)) { ++ return false; ++ } ++ // All operands F32 (rms norm / fused mul / silu kernel only support F32). ++ if (rms_norm->src[0]->type != GGML_TYPE_F32 || rms_norm->type != GGML_TYPE_F32 || ++ mul->src[0]->type != GGML_TYPE_F32 || mul->src[1]->type != GGML_TYPE_F32 || mul->type != GGML_TYPE_F32 || ++ silu->src[0]->type != GGML_TYPE_F32 || silu->type != GGML_TYPE_F32 || ++ gate_mul->src[0]->type != GGML_TYPE_F32 || gate_mul->src[1]->type != GGML_TYPE_F32 || ++ gate_mul->type != GGML_TYPE_F32) { ++ return false; ++ } ++ // If rms_norm is the B operand of the weight mul, broadcast of A is unsupported. ++ if (rms_norm == mul->src[1] && !ggml_are_same_shape(mul->src[0], rms_norm)) { ++ return false; ++ } ++ // The fused kernel reads contiguous rows for the norm input, the weight, ++ // and the gate, and writes a contiguous output. ++ if (!ggml_is_contiguous_rows(rms_norm->src[0]) || ++ !ggml_is_contiguous_rows(mul->src[0]) || !ggml_is_contiguous_rows(mul->src[1]) || ++ !ggml_is_contiguous_rows(silu->src[0]) || ++ !ggml_is_contiguous_rows(gate_mul->src[0]) || !ggml_is_contiguous_rows(gate_mul->src[1])) { ++ return false; ++ } ++ return true; ++ } ++ + if (!ggml_can_fuse(cgraph, node_idx, ops)) { + return false; + } +@@ -4350,6 +4404,19 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph + return 2; + } + ++ // Fused gated RMS norm: RMS norm + weight multiply + SiLU-gated multiply ++ // (bit-exact). The Qwen3.6 gated-DeltaNet output norm. Default ON; set ++ // LLAMA_FUSE_GATE_RMSNORM=0 for a clean A/B against the unfused path. ++ static const bool fuse_gate_rmsnorm = [] { ++ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM"); ++ return e == nullptr || atoi(e) != 0; ++ }(); ++ if (fuse_gate_rmsnorm && ++ ggml_cuda_can_fuse(cgraph, i, { GGML_OP_UNARY, GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_MUL }, { GGML_UNARY_OP_SILU })) { ++ ggml_cuda_op_rms_norm_gate_mul(*cuda_ctx, cgraph->nodes[i + 1], cgraph->nodes[i + 2], node, cgraph->nodes[i + 3]); ++ return 3; ++ } ++ + if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_ADD }, {})) { + ggml_cuda_op_rms_norm_fused_add(*cuda_ctx, node, cgraph->nodes[i + 1], cgraph->nodes[i + 2]); + return 2; +diff --git a/ggml/src/ggml-cuda/norm.cu b/ggml/src/ggml-cuda/norm.cu +index a07d02276..e776c67d2 100644 +--- a/ggml/src/ggml-cuda/norm.cu ++++ b/ggml/src/ggml-cuda/norm.cu +@@ -235,6 +235,95 @@ static __global__ void rms_norm_pre_add_mul_f32(const float * a, + } + } + ++// Fused gated RMS norm: RMS norm + weight multiply + SiLU gate multiply. ++// dst = (rsqrt(mean(x^2)+eps) * x * w) * silu(z) with silu(z) = z/(1+expf(-z)) ++// This is the gated-DeltaNet output norm `self.norm(core_attn_out, z)` of the ++// Qwen3.6 hybrid models (build_norm_gated): rms_norm(x) scaled by the per-head ++// ssm_norm weight `w`, then gated by silu of the gate activation `z`. Unfused it ++// runs as rms_norm_mul (scale*x*w) -> silu(z) -> mul; fusing it keeps the ++// normalized intermediate in registers so it never round-trips to HBM. ++// ++// Bit-exactness: the sum(x^2) reduction uses the same block_reduce with the ++// same 256/1024 block-size thresholds and the same rsqrtf(mean+eps) as rms_norm, ++// the weight multiply reproduces rms_norm_mul's `scale*x[col]*w[col]` order, and ++// silu reuses the exact `z/(1+expf(-z))` of ggml_cuda_op_silu_single. Float ++// multiply is commutative, so `(scale*x*w) * silu(z)` is byte-identical to the ++// unfused `mul(rms_norm_mul, silu(z))` (whether or not silu+mul was itself fused). ++// `w` (the RMS weight) and `z` (the gate) both broadcast via the packed-modulo path. ++template ++static __global__ void rms_norm_gate_mul_f32(const float * x, ++ float * dst, ++ const int ncols, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float eps, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint3 mul_ncols_packed, ++ const uint3 mul_nrows_packed, ++ const uint3 mul_nchannels_packed, ++ const uint3 mul_nsamples_packed, ++ const float * gate, ++ const int64_t gate_stride_row, ++ const int64_t gate_stride_channel, ++ const int64_t gate_stride_sample, ++ const uint3 gate_ncols_packed, ++ const uint3 gate_nrows_packed, ++ const uint3 gate_nchannels_packed, ++ const uint3 gate_nsamples_packed) { ++ ggml_cuda_pdl_lc(); ++ const int nrows = gridDim.x; ++ const int nchannels = gridDim.y; ++ ++ const int row = blockIdx.x; ++ const int channel = blockIdx.y; ++ const int sample = blockIdx.z; ++ const int tid = threadIdx.x; ++ ++ x += sample*stride_sample + channel*stride_channel + row*stride_row; ++ // dst is laid out contiguously by the scheduler for the (final) MUL output ++ dst += ((sample*nchannels + channel)*nrows + row)*ncols; ++ ++ { ++ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed); ++ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed); ++ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed); ++ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row; ++ } ++ { ++ const uint32_t gate_row = fastmodulo(row, gate_nrows_packed); ++ const uint32_t gate_channel = fastmodulo(channel, gate_nchannels_packed); ++ const uint32_t gate_sample = fastmodulo(sample, gate_nsamples_packed); ++ gate += gate_sample * gate_stride_sample + gate_channel * gate_stride_channel + gate_row * gate_stride_row; ++ } ++ ++ float tmp = 0.0f; // partial sum for thread in warp ++ ++ ggml_cuda_pdl_sync(); ++ for (int col = tid; col < ncols; col += block_size) { ++ const float xi = x[col]; ++ tmp += xi * xi; ++ } ++ ++ // sum up partial sums ++ extern __shared__ float s_sum[]; ++ tmp = block_reduce(tmp, s_sum); ++ ++ const float mean = tmp / ncols; ++ const float scale = rsqrtf(mean + eps); ++ ++ for (int col = tid; col < ncols; col += block_size) { ++ const int mul_col = fastmodulo(col, mul_ncols_packed); ++ const int gate_col = fastmodulo(col, gate_ncols_packed); ++ const float zi = gate[gate_col]; ++ const float silu_z = zi / (1.0f + expf(-zi)); ++ dst[col] = scale * x[col] * mul[mul_col] * silu_z; ++ } ++} ++ + template + static __global__ void rms_norm_back_f32( + const float * grad, const float * xf, float * dst, const int ncols, const float eps) { +@@ -532,6 +621,65 @@ static void rms_norm_pre_add_mul_f32_cuda(const float * a, + } + } + ++static void rms_norm_gate_mul_f32_cuda(const float * x, ++ float * dst, ++ const int ncols, ++ const int nrows, ++ const int nchannels, ++ const int nsamples, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint32_t mul_ncols, ++ const uint32_t mul_nrows, ++ const uint32_t mul_nchannels, ++ const uint32_t mul_nsamples, ++ const float * gate, ++ const int64_t gate_stride_row, ++ const int64_t gate_stride_channel, ++ const int64_t gate_stride_sample, ++ const uint32_t gate_ncols, ++ const uint32_t gate_nrows, ++ const uint32_t gate_nchannels, ++ const uint32_t gate_nsamples, ++ const float eps, ++ cudaStream_t stream) { ++ const dim3 blocks_num(nrows, nchannels, nsamples); ++ GGML_ASSERT(mul != nullptr); ++ GGML_ASSERT(gate != nullptr); ++ const uint3 mul_ncols_packed = init_fastdiv_values(mul_ncols); ++ const uint3 mul_nrows_packed = init_fastdiv_values(mul_nrows); ++ const uint3 mul_nchannels_packed = init_fastdiv_values(mul_nchannels); ++ const uint3 mul_nsamples_packed = init_fastdiv_values(mul_nsamples); ++ const uint3 gate_ncols_packed = init_fastdiv_values(gate_ncols); ++ const uint3 gate_nrows_packed = init_fastdiv_values(gate_nrows); ++ const uint3 gate_nchannels_packed = init_fastdiv_values(gate_nchannels); ++ const uint3 gate_nsamples_packed = init_fastdiv_values(gate_nsamples); ++ if (ncols < 1024) { ++ const dim3 block_dims(256, 1, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_gate_mul_f32<256>, launch_params, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, ++ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed, ++ gate, gate_stride_row, gate_stride_channel, gate_stride_sample, ++ gate_ncols_packed, gate_nrows_packed, gate_nchannels_packed, gate_nsamples_packed); ++ } else { ++ const dim3 block_dims(1024, 1, 1); ++ const ggml_cuda_kernel_launch_params launch_params = ggml_cuda_kernel_launch_params{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_gate_mul_f32<1024>, launch_params, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, ++ mul_ncols_packed, mul_nrows_packed, mul_nchannels_packed, mul_nsamples_packed, ++ gate, gate_stride_row, gate_stride_channel, gate_stride_sample, ++ gate_ncols_packed, gate_nrows_packed, gate_nchannels_packed, gate_nsamples_packed); ++ } ++} ++ + static void rms_norm_back_f32_cuda(const float * grad, const float * xf, float * dst, const int ncols, const int nrows, const float eps, cudaStream_t stream) { + if (ncols < 1024) { + const dim3 block_dims(WARP_SIZE, 1, 1); +@@ -843,6 +991,73 @@ void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx, + eps, stream); + } + ++void ggml_cuda_op_rms_norm_gate_mul(ggml_backend_cuda_context & ctx, ++ ggml_tensor * rms_norm_tensor, ++ ggml_tensor * mul_tensor, ++ ggml_tensor * silu_tensor, ++ ggml_tensor * gate_mul_tensor) { ++ // mul = rms_norm(x) * w ; silu = silu(z) ; gate_mul = mul * silu ++ GGML_ASSERT(mul_tensor->src[0] == rms_norm_tensor || mul_tensor->src[1] == rms_norm_tensor); ++ GGML_ASSERT(gate_mul_tensor->src[0] == silu_tensor || gate_mul_tensor->src[1] == silu_tensor); ++ ++ const ggml_tensor * x_src = rms_norm_tensor->src[0]; ++ const ggml_tensor * w_src = (mul_tensor->src[0] == rms_norm_tensor) ? mul_tensor->src[1] : mul_tensor->src[0]; ++ const ggml_tensor * gate_src = silu_tensor->src[0]; ++ ++ float eps = 0.0f; ++ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float)); ++ GGML_ASSERT(eps >= 0.0f); ++ ++ const float * x_d = (const float *) x_src->data; ++ const float * w_d = (const float *) w_src->data; ++ const float * gate_d = (const float *) gate_src->data; ++ float * dst_d = (float *) gate_mul_tensor->data; ++ cudaStream_t stream = ctx.stream(); ++ ++ GGML_ASSERT(x_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(w_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(gate_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(silu_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(gate_mul_tensor->type == GGML_TYPE_F32); ++ ++ const int64_t ne00 = rms_norm_tensor->ne[0]; ++ const int64_t ne01 = rms_norm_tensor->ne[1]; ++ const int64_t ne02 = rms_norm_tensor->ne[2]; ++ const int64_t ne03 = rms_norm_tensor->ne[3]; ++ ++ // x (the rms-norm input) strides; cols must be contiguous ++ const size_t ts0 = ggml_type_size(x_src->type); ++ GGML_ASSERT(x_src->nb[0] == ts0); ++ const int64_t s01 = x_src->nb[1] / ts0; ++ const int64_t s02 = x_src->nb[2] / ts0; ++ const int64_t s03 = x_src->nb[3] / ts0; ++ ++ // weight (the RMS scale) strides + broadcast extents ++ const size_t ts_mul = ggml_type_size(w_src->type); ++ GGML_ASSERT(w_src->nb[0] == ts_mul); ++ const int64_t mul_s01 = w_src->nb[1] / ts_mul; ++ const int64_t mul_s02 = w_src->nb[2] / ts_mul; ++ const int64_t mul_s03 = w_src->nb[3] / ts_mul; ++ ++ // gate (the silu activation) strides + broadcast extents ++ const size_t ts_gate = ggml_type_size(gate_src->type); ++ GGML_ASSERT(gate_src->nb[0] == ts_gate); ++ const int64_t gate_s01 = gate_src->nb[1] / ts_gate; ++ const int64_t gate_s02 = gate_src->nb[2] / ts_gate; ++ const int64_t gate_s03 = gate_src->nb[3] / ts_gate; ++ ++ rms_norm_gate_mul_f32_cuda(x_d, dst_d, ++ ne00, ne01, ne02, ne03, ++ /*s00*/ s01, s02, s03, ++ w_d, /*mul_s00*/ mul_s01, mul_s02, mul_s03, ++ w_src->ne[0], w_src->ne[1], w_src->ne[2], w_src->ne[3], ++ gate_d, /*gate_s00*/ gate_s01, gate_s02, gate_s03, ++ gate_src->ne[0], gate_src->ne[1], gate_src->ne[2], gate_src->ne[3], ++ eps, stream); ++} ++ + void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst) { + const ggml_tensor * grad = dst->src[0]; // gradients + const ggml_tensor * src0f = dst->src[1]; // src0 from forward pass +diff --git a/ggml/src/ggml-cuda/norm.cuh b/ggml/src/ggml-cuda/norm.cuh +index 05396cdf0..4d6dba6fa 100644 +--- a/ggml/src/ggml-cuda/norm.cuh ++++ b/ggml/src/ggml-cuda/norm.cuh +@@ -17,6 +17,12 @@ void ggml_cuda_op_rms_norm_pre_add_mul(ggml_backend_cuda_context & ctx, + ggml_tensor * add_tensor, + ggml_tensor * rms_norm_tensor, + ggml_tensor * mul_tensor); ++void ggml_cuda_op_rms_norm_gate_mul(ggml_backend_cuda_context & ctx, ++ ggml_tensor * rms_norm_tensor, ++ ggml_tensor * mul_tensor, ++ ggml_tensor * silu_tensor, ++ ggml_tensor * gate_mul_tensor); ++ + + void ggml_cuda_op_rms_norm_back(ggml_backend_cuda_context & ctx, ggml_tensor * dst); + +diff --git a/src/models/qwen35.cpp b/src/models/qwen35.cpp +index 66064869e..98751f7cc 100644 +--- a/src/models/qwen35.cpp ++++ b/src/models/qwen35.cpp +@@ -1,4 +1,5 @@ + #include "models.h" ++#include + #include "llama-memory-recurrent.h" + + void llama_model_qwen35::load_arch_hparams(llama_model_loader & ml) { +@@ -251,6 +252,21 @@ ggml_tensor * llama_model_qwen35::graph::build_norm_gated( + ggml_tensor * normalized = build_norm(input, weights, nullptr, LLM_NORM_RMS, layer); + ggml_tensor * gated_silu = ggml_silu(ctx0, gate); + ++ // Emit the gate multiply as mul(silu(z), normalized) so the gated-DeltaNet ++ // output-norm chain forms the consecutive subgraph { SILU, RMS_NORM, MUL, MUL } ++ // that the CUDA backend fuses into one rms_norm_gate_mul kernel (the normalized ++ // intermediate then never round-trips to HBM). The gate z-projection is scheduled ++ // before the SILU, so the natural mul(normalized, silu) order leaves a GEMM ++ // between the weight MUL and the SILU and is not fusable. Multiplication is ++ // commutative, so this is bit-exact vs mul(normalized, silu). ++ // LLAMA_FUSE_GATE_RMSNORM=0 keeps the original operand order (kernel fusion off). ++ static const bool fuse_gate_rmsnorm = [] { ++ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM"); ++ return e == nullptr || atoi(e) != 0; ++ }(); ++ if (fuse_gate_rmsnorm) { ++ return ggml_mul(ctx0, gated_silu, normalized); ++ } + return ggml_mul(ctx0, normalized, gated_silu); + } + +diff --git a/src/models/qwen35moe.cpp b/src/models/qwen35moe.cpp +index a79917628..071b88daa 100644 +--- a/src/models/qwen35moe.cpp ++++ b/src/models/qwen35moe.cpp +@@ -1,4 +1,5 @@ + #include "models.h" ++#include + #include "llama-memory-recurrent.h" + + void llama_model_qwen35moe::load_arch_hparams(llama_model_loader & ml) { +@@ -275,6 +276,21 @@ ggml_tensor * llama_model_qwen35moe::graph::build_norm_gated( + ggml_tensor * normalized = build_norm(input, weights, nullptr, LLM_NORM_RMS, layer); + ggml_tensor * gated_silu = ggml_silu(ctx0, gate); + ++ // Emit the gate multiply as mul(silu(z), normalized) so the gated-DeltaNet ++ // output-norm chain forms the consecutive subgraph { SILU, RMS_NORM, MUL, MUL } ++ // that the CUDA backend fuses into one rms_norm_gate_mul kernel (the normalized ++ // intermediate then never round-trips to HBM). The gate z-projection is scheduled ++ // before the SILU, so the natural mul(normalized, silu) order leaves a GEMM ++ // between the weight MUL and the SILU and is not fusable. Multiplication is ++ // commutative, so this is bit-exact vs mul(normalized, silu). ++ // LLAMA_FUSE_GATE_RMSNORM=0 keeps the original operand order (kernel fusion off). ++ static const bool fuse_gate_rmsnorm = [] { ++ const char * e = getenv("LLAMA_FUSE_GATE_RMSNORM"); ++ return e == nullptr || atoi(e) != 0; ++ }(); ++ if (fuse_gate_rmsnorm) { ++ return ggml_mul(ctx0, gated_silu, normalized); ++ } + return ggml_mul(ctx0, normalized, gated_silu); + } + +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0046-paged-gate-GDN-prefill-geometry-by-scan-length.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0046-paged-gate-GDN-prefill-geometry-by-scan-length.patch new file mode 100644 index 000000000000..caaee575e315 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0046-paged-gate-GDN-prefill-geometry-by-scan-length.patch @@ -0,0 +1,64 @@ +From 85266d4c10750b419716e4b8939ebd96ab424630 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 00:51:26 +0000 +Subject: [PATCH] feat(paged): gate GDN prefill geometry by scan length (patch + 0046) + +Patch 0022 retuned the gated-DeltaNet (GDN) sequential-recurrence dispatch +(case 128) to a (NUM_WARPS=16, COLS_PER_WARP=8) column-fold tile. That is a +DECODE win (short scans: small n_tokens, large n_seqs) but an UNCONDITIONAL +dense-prefill regression vs stock: on a long sequential scan the launch grid.z +collapses from S_v/4=32 to S_v/(16*8)=1, so the SMs starve. Profiling the +dense-prefill path attributed the whole regression (~-6%) to gated_delta_net +(+54% GPU time) at the (16,8) geometry. + +Gate the geometry by per-call scan length instead of applying (16,8) +unconditionally. Long scans (prefill, n_tokens >= GDN_PREFILL_NTOK, default 256) +take stock's high-grid.z (4,1) geometry; short scans (decode) keep the (16,8) +retune. This recovers dense prefill +7.2% back to stock parity while preserving +the (16,8) decode win. + +Bit-exact: patch 0022 proved every selectable {NUM_WARPS, COLS_PER_WARP} variant +is byte-identical (the sweep cannot change the md5), so this scan-length gate is +greedy-md5 bit-exact. GDN_PREFILL_NTOK tunes the crossover; the explicit +GDN_NW / GDN_CPW one-build %peak sweep still wins (the gate yields when either is +set), so the A/B harness is unchanged. + +Root cause: patch 0022 applied the (16,8) tile unconditionally. This patch +sequences after 0022/0044 (it edits the same gated_delta_net.cu case-128 +dispatch) and adds only the scan-length gate. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto + +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index 7121d807f..26667afa2 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -550,6 +550,23 @@ static void launch_gated_delta_net( + launch_gdn_variant<64, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); + break; + case 128: { ++ // Dense-prefill regression fix: gate patch 0022's column-fold geometry by per-call scan ++ // length. The (16,8) tile is a DECODE win (short scans: n_tokens small, n_seqs large) but a ++ // long-sequential-scan PREFILL loss - grid.z collapses from S_v/4=32 to S_v/(16*8)=1, so the ++ // SMs starve on the long scan (profiled: gated_delta_net +54% GPU time == the whole dense- ++ // prefill regression). Long scans (prefill) take stock's high-grid.z (4,1) geometry; short ++ // scans (decode) keep the (16,8) winner. Every {NW,CPW} variant is byte-identical (patch 0022 ++ // proved md5-invariance across the ladder), so this stays greedy-md5 bit-exact. Default-on; ++ // GDN_PREFILL_NTOK tunes the crossover; the explicit GDN_NW/GDN_CPW sweep still wins (gate ++ // yields when either is set) so the one-build %peak A/B harness is unchanged. ++ static const int64_t gdn_prefill_ntok = ++ []{ const char * e = getenv("GDN_PREFILL_NTOK"); return e ? (int64_t) atoll(e) : (int64_t) 256; }(); ++ static const bool gdn_nw_forced = (getenv("GDN_NW") != nullptr); ++ static const bool gdn_cpw_forced = (getenv("GDN_CPW") != nullptr); ++ if (n_tokens >= gdn_prefill_ntok && !gdn_nw_forced && !gdn_cpw_forced) { ++ launch_gdn_variant<128, KDA, keep_rs_t, 4, 1, 2>(GDN_LAUNCH_ARGS); ++ break; ++ } + // Bit-exact occupancy/coalescing retune (patch 0022): fold COLS_PER_WARP columns per warp + // to raise per-warp memory-level parallelism on this bandwidth-bound recurrence. Default is + // the measured winner; GDN_NW / GDN_CPW override it for the one-build %peak sweep (every +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0047-paged-GDN-M5-tensor-core-chunked-scan-f32.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0047-paged-GDN-M5-tensor-core-chunked-scan-f32.patch new file mode 100644 index 000000000000..fb4622ba8173 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0047-paged-GDN-M5-tensor-core-chunked-scan-f32.patch @@ -0,0 +1,713 @@ +From 2c32ab8b7a6c5bc90454881b8c10f8bad4f7cee0 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 09:45:13 +0200 +Subject: [PATCH] feat(paged): GDN M5 tensor-core chunked-scan prefill, + f32-only re-port (was patch 0044) + +Re-port the M5 tensor-core chunked gated-DeltaNet (GDN) prefill kernel from the +bf16/hybrid dev tree as an f32-only native commit, recovering the prefill win +that patch 0044 encoded, on the f32-only series (0026 ssm_bf16_tau dropped). + +What landed (f32/tf32 only): +- The mma.sync m16n8k8 helpers (tf32 + 3xtf32 limb-split; decays/gamma/beta stay + f32 outside the mma to preserve the bounded de-gating). +- gated_delta_net_chunked_cuda: the full tensor-core chunked scan, + KK/QK Gram (M2), KS/QS state-boundary 3xtf32 (M3), P*U output (M4), and the + form-T (A^-1) solve + Kc^T*DU state-update (M5). Selected by GDN_TC (0=serial + .. 4/5+=M5); the C=16 chunk-state stays in the 64KB smem buffer. +- Default-on under paged KV: GDN_TC=5, GDN_CHUNK_MIN=64 when LLAMA_KV_PAGED is set + and the user has not overridden either; OFF (INT_MAX) otherwise so the stock / + non-paged default is regression-free. GDN_CHUNK_MIN must stay > 1 (decode is 1 + token/call; at 1 the chunked path swallows decode and collapses S_TG). + +Stripped (not part of the f32-only series): the STATE_BF16 / HYBRID / gdn_state_t +/ gdn_hybrid_args template machinery (from dropped patch 0026), and the bf16 +CONFIG-C (M8) plus register-resident M6/M7 occupancy variants. The 0046 dense- +prefill geometry gate is untouched and coexists (it gates the SERIAL path; M5 is +the chunked path). + +Gates (GB10, sm_121a): +- Builds clean. +- Greedy md5 bit-exact (per-path, n=48 --temp 0 --seed 1, paged): dense + q36-27b-nvfp4 = 5951a5b4d624ce891e22ab5fca9bc439, MoE q36-35b-a3b-nvfp4 = + 8cb0ce23777bf55f92f63d0292c756b0, both default AND force-M5 (GDN_CHUNK_MIN=1). + test-backend-ops GATED_DELTA_NET 46/46 default and force-M5 (incl. the + multi-chunk, tail-chunk and multi-seq shapes). +- Prefill S_PP, MoE, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, -ntg 4 -npl 32, + vs the patch-0044 baseline (pre-0046, GDN_PREFILL_NTOK huge): +4.3% @512, + +17.8% @2048 (reproduces patch 0044; M5-on absolute matches patch 0044 M5). + vs the current 0046 baseline (0046 already raised the long-scan sequential + prefill): +4.3% @512, +1.2% @2048. +- Decode S_TG unchanged (within run noise). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/gated_delta_net.cu | 550 +++++++++++++++++++++++--- + tests/test-backend-ops.cpp | 5 + + 2 files changed, 496 insertions(+), 59 deletions(-) + +diff --git a/ggml/src/ggml-cuda/gated_delta_net.cu b/ggml/src/ggml-cuda/gated_delta_net.cu +index 26667afa2..0ceb1bc8f 100644 +--- a/ggml/src/ggml-cuda/gated_delta_net.cu ++++ b/ggml/src/ggml-cuda/gated_delta_net.cu +@@ -298,7 +298,115 @@ static void launch_gdn_variant( + // strong-decay tokens underflow to the correct zero rather than to inf. The math + // is equivalent to the sequential recurrence up to FP reduction order (a NEW + // per-path result, validated benign by test-backend-ops NMSE and greedy output). +-template ++// --- Phase-1 tensor-core Gram helpers (tf32 m16n8k8 mma.sync; sm_80+/sm_121a). --- ++// Reproduces the PoC-proven path (~/scratch_tc_gdn_poc/gdn_gram_bench.cu, tf32 NMSE ~3e-9): ++// out[rowbase..+15][colbase..+7] = Xs[rows] . Ys[cols], Xs/Ys row-major [*][DK]. ++__device__ __forceinline__ unsigned gdn_f2tf32(float f) { ++#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800 ++ unsigned r; ++ asm("cvt.rna.tf32.f32 %0, %1;" : "=r"(r) : "f"(f)); ++ return r; ++#else ++ (void) f; ++ return 0u; ++#endif ++} ++ ++// Operand loaders for the Gram/state mma helpers: stage f32 operands as tf32. This f32-only ++// re-port keeps every operand full-width -- the plain-tf32 path (10-bit mantissa, f32 accumulate) ++// is the highest-precision tensor-core option on sm_121a, and the 3xtf32 limb-split helpers below ++// recover near-f32 accuracy for the decay-coupled state-boundary (KS/QS) and state-carry products ++// whose error feeds the A-inverse solve / compounds across chunks. ++__device__ __forceinline__ unsigned gdn_ld_tf32(float f) { return gdn_f2tf32(f); } ++__device__ __forceinline__ float gdn_ld_f32 (float f) { return f; } ++ ++__device__ __forceinline__ void gdn_mma_m16n8k8(float c[4], const unsigned a[4], const unsigned b[2]) { ++#if defined(__CUDA_ARCH__) && __CUDA_ARCH__ >= 800 ++ asm volatile( ++ "mma.sync.aligned.m16n8k8.row.col.f32.tf32.tf32.f32 " ++ "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%0,%1,%2,%3};\n" ++ : "+f"(c[0]), "+f"(c[1]), "+f"(c[2]), "+f"(c[3]) ++ : "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]), "r"(b[0]), "r"(b[1])); ++#else ++ (void) c; (void) a; (void) b; ++#endif ++} ++ ++template ++__device__ __forceinline__ void gdn_gram_tile_mma( ++ float c[4], const TX * __restrict__ Xs, const TY * __restrict__ Ys, ++ int rowbase, int colbase, int lg, int lt) { ++ c[0] = c[1] = c[2] = c[3] = 0.0f; ++ #pragma unroll ++ for (int ks = 0; ks < DK; ks += 8) { ++ unsigned a[4], b[2]; ++ a[0] = gdn_ld_tf32(Xs[(rowbase + lg ) * DK + ks + lt ]); ++ a[1] = gdn_ld_tf32(Xs[(rowbase + lg + 8) * DK + ks + lt ]); ++ a[2] = gdn_ld_tf32(Xs[(rowbase + lg ) * DK + ks + lt + 4]); ++ a[3] = gdn_ld_tf32(Xs[(rowbase + lg + 8) * DK + ks + lt + 4]); ++ b[0] = gdn_ld_tf32(Ys[(colbase + lg ) * DK + ks + lt ]); ++ b[1] = gdn_ld_tf32(Ys[(colbase + lg ) * DK + ks + lt + 4]); ++ gdn_mma_m16n8k8(c, a, b); ++ } ++} ++ ++// 3xtf32 (CUTLASS fp32-emulation): split each f32 operand into hi/lo tf32 limbs and run ++// 3 limb-products per k-subtile (hi*hi + hi*lo + lo*hi); ~f32 accuracy at ~3x the mma count. ++// Used for the state-boundary products (KS/QS) whose error feeds the A-inverse solve (M3). ++template ++__device__ __forceinline__ void gdn_gram_tile_mma_3x( ++ float c[4], const TX * __restrict__ Xs, const TY * __restrict__ Ys, ++ int rowbase, int colbase, int lg, int lt) { ++ c[0] = c[1] = c[2] = c[3] = 0.0f; ++ #pragma unroll ++ for (int ks = 0; ks < DK; ks += 8) { ++ float af[4], bf[2]; ++ af[0] = gdn_ld_f32(Xs[(rowbase + lg ) * DK + ks + lt ]); ++ af[1] = gdn_ld_f32(Xs[(rowbase + lg + 8) * DK + ks + lt ]); ++ af[2] = gdn_ld_f32(Xs[(rowbase + lg ) * DK + ks + lt + 4]); ++ af[3] = gdn_ld_f32(Xs[(rowbase + lg + 8) * DK + ks + lt + 4]); ++ bf[0] = gdn_ld_f32(Ys[(colbase + lg ) * DK + ks + lt ]); ++ bf[1] = gdn_ld_f32(Ys[(colbase + lg ) * DK + ks + lt + 4]); ++ unsigned ahi[4], alo[4], bhi[2], blo[2]; ++ #pragma unroll ++ for (int z = 0; z < 4; z++) { ahi[z] = gdn_f2tf32(af[z]); alo[z] = gdn_f2tf32(af[z] - __uint_as_float(ahi[z])); } ++ #pragma unroll ++ for (int z = 0; z < 2; z++) { bhi[z] = gdn_f2tf32(bf[z]); blo[z] = gdn_f2tf32(bf[z] - __uint_as_float(bhi[z])); } ++ gdn_mma_m16n8k8(c, ahi, bhi); // hi*hi (dominant limb) ++ gdn_mma_m16n8k8(c, ahi, blo); // hi*lo ++ gdn_mma_m16n8k8(c, alo, bhi); // lo*hi ++ } ++} ++ ++// State-update tile (P6): S_C[i][j] += sum_t Kc[t][i] * DU[t][j], with Kc read TRANSPOSED ++// (i as the m16n8k8 M-row, t as the K-contraction) and DU = d(t,last)*U staged in the Ud ++// layout (DUd[j*KC + t]). 3xtf32: the cross-chunk carry compounds over every chunk step. ++template ++__device__ __forceinline__ void gdn_state_tile_mma_3x( ++ float c[4], const TK * __restrict__ Kc, const TD * __restrict__ DUd, ++ int rowbase, int colbase, int lg, int lt) { ++ c[0] = c[1] = c[2] = c[3] = 0.0f; ++ #pragma unroll ++ for (int ks = 0; ks < KC; ks += 8) { ++ float af[4], bf[2]; ++ af[0] = gdn_ld_f32(Kc[(ks + lt ) * DK + (rowbase + lg )]); ++ af[1] = gdn_ld_f32(Kc[(ks + lt ) * DK + (rowbase + lg + 8)]); ++ af[2] = gdn_ld_f32(Kc[(ks + lt + 4) * DK + (rowbase + lg )]); ++ af[3] = gdn_ld_f32(Kc[(ks + lt + 4) * DK + (rowbase + lg + 8)]); ++ bf[0] = gdn_ld_f32(DUd[(colbase + lg) * KC + (ks + lt )]); ++ bf[1] = gdn_ld_f32(DUd[(colbase + lg) * KC + (ks + lt + 4)]); ++ unsigned ahi[4], alo[4], bhi[2], blo[2]; ++ #pragma unroll ++ for (int z = 0; z < 4; z++) { ahi[z] = gdn_f2tf32(af[z]); alo[z] = gdn_f2tf32(af[z] - __uint_as_float(ahi[z])); } ++ #pragma unroll ++ for (int z = 0; z < 2; z++) { bhi[z] = gdn_f2tf32(bf[z]); blo[z] = gdn_f2tf32(bf[z] - __uint_as_float(bhi[z])); } ++ gdn_mma_m16n8k8(c, ahi, bhi); ++ gdn_mma_m16n8k8(c, ahi, blo); ++ gdn_mma_m16n8k8(c, alo, bhi); ++ } ++} ++ ++template + __global__ void gated_delta_net_chunked_cuda( + const float * __restrict__ q, const float * __restrict__ k, + const float * __restrict__ v, const float * __restrict__ g, +@@ -329,6 +437,9 @@ __global__ void gated_delta_net_chunked_cuda( + float * csh = Amat + (size_t) C * C; // [C] cumsum(log-gate) + float * gam = csh + C; // [C] gamma_t = exp(cs_t) + float * bet = gam + C; // [C] beta_t ++ // Phase-1 tensor-core Gram scratch (allocated only when GRAM_MMA; KK feeds A, QK feeds P). ++ float * KKsh = bet + C; // [C*C] KK[t][t'] = k_t . k_t' (stride C) ++ float * QKsh = KKsh + (size_t) C * C; // [C*C] QK[t][t'] = q_t . k_t' (stride C) + + // S0: thread j owns column j (Sd[j*dk + i]); load is a contiguous per-thread copy from the + // M-layout cache view (read_state[j*dk + i] = M[j*S_v + i] = S[i][j]). Same identity/gather +@@ -357,6 +468,15 @@ __global__ void gated_delta_net_chunked_cuda( + Kc[t * dk + i] = k_base[(c0 + t) * sq2 + i]; + Qc[t * dk + i] = q_base[(c0 + t) * sq2 + i]; + } ++ if constexpr (TC >= 3) { ++ // Zero the stale K/Q tail (rows t >= Cc): the tensor-core mma paths contract the full ++ // chunk dim and 0*NaN (uninitialized smem) would poison the result. Serial paths only ++ // touch t < Cc, so this is gated to the mma levels. ++ for (int e = Cc * dk + j; e < C * dk; e += dv) { ++ Kc[e] = 0.0f; ++ Qc[e] = 0.0f; ++ } ++ } + if (j < Cc) { + csh[j] = g[gb_base + (c0 + j) * sb2]; // raw log-gate, prefix-summed below + bet[j] = beta[gb_base + (c0 + j) * sb2]; +@@ -372,15 +492,53 @@ __global__ void gated_delta_net_chunked_cuda( + } + __syncthreads(); + ++ // --- Phase-1: tensor-core tf32 Gram products (KK->A via warp0, QK->P via warp1). --- ++ // Full C x C tiles into KKsh/QKsh (stride C); decay/beta applied in f32 in the loops below. ++ // Tail chunks (Cc= Cc, but those entries are never read. ++ if constexpr (TC >= 1) { ++ const int w = threadIdx.x >> 5; // warp: 0 -> KK, 1 -> QK ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; // 0..7 ++ const int lt = lane & 3; // 0..3 ++ if (w < 2) { ++ const float * Xs = (w == 0) ? Kc : Qc; ++ float * Out = (w == 0) ? KKsh : QKsh; ++ #pragma unroll ++ for (int mt = 0; mt < (C + 15) / 16; mt++) { ++ const int rowbase = mt * 16; ++ #pragma unroll ++ for (int nt = 0; nt < (C + 7) / 8; nt++) { ++ const int colbase = nt * 8; ++ float cc[4]; ++ gdn_gram_tile_mma(cc, Xs, Kc, rowbase, colbase, lg, lt); ++ const int rr[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; ++ const int ccol[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ if (rr[l] < C && ccol[l] < C) { ++ Out[rr[l] * C + ccol[l]] = cc[l]; ++ } ++ } ++ } ++ } ++ } ++ __syncthreads(); ++ } ++ + // --- A = I + tril(beta_t * d(t',t) * (k_t . k_t'), -1) (cooperative over C*C) --- + for (int e = j; e < Cc * Cc; e += dv) { + const int t = e / Cc; + const int tp = e % Cc; + float a = 0.0f; + if (tp < t) { +- float kk = 0.0f; +- for (int i = 0; i < dk; i++) { +- kk += Kc[t * dk + i] * Kc[tp * dk + i]; ++ float kk; ++ if constexpr (TC >= 1) { ++ kk = KKsh[t * C + tp]; ++ } else { ++ kk = 0.0f; ++ for (int i = 0; i < dk; i++) { ++ kk += Kc[t * dk + i] * Kc[tp * dk + i]; ++ } + } + const float dd = expf(csh[t] - csh[tp]); // d(tp,t) = gamma_t/gamma_tp + a = bet[t] * dd * kk; +@@ -392,65 +550,304 @@ __global__ void gated_delta_net_chunked_cuda( + __syncthreads(); + + // --- RHS[t][j] = beta_t (v_t[j] - gamma_t * (S0^T k_t)[j]) -> Ud[j*C + t] --- +- for (int t = 0; t < Cc; t++) { +- float ks = 0.0f; // (S0^T k_t)[j] = sum_i S[i][j] k_t[i] +- for (int i = 0; i < dk; i++) { +- ks += Sd[j * dk + i] * Kc[t * dk + i]; ++ if constexpr (TC >= 2) { ++ // M3: fused tensor-core KS = Kc * S0 (3xtf32 state-boundary product). The mma ++ // output is consumed straight from registers into RHS -> Ud, so NO extra C*dv ++ // smem buffer is needed (the 64KB state still occupies smem until M6). Warp w ++ // owns dv n-tiles [w*NTPW, ..); each lane writes the RHS entries it produced. ++ const int w = threadIdx.x >> 5; ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; ++ const int lt = lane & 3; ++ constexpr int NWARP = S_v / 32; ++ constexpr int NT = dv / 8; ++ constexpr int NTPW = (NT + NWARP - 1) / NWARP; ++ #pragma unroll ++ for (int mt = 0; mt < (C + 15) / 16; mt++) { ++ const int rowbase = mt * 16; ++ #pragma unroll ++ for (int nn = 0; nn < NTPW; nn++) { ++ const int nt = w * NTPW + nn; ++ if (nt >= NT) break; ++ const int colbase = nt * 8; ++ float cc[4]; ++ gdn_gram_tile_mma_3x(cc, Kc, Sd, rowbase, colbase, lg, lt); ++ const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; ++ const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ const int t = tt[l], jc = jj[l]; ++ if (t < Cc && jc < dv) { ++ const float vtj = v_base[(c0 + t) * sv2 + jc]; ++ Ud[jc * C + t] = bet[t] * (vtj - gam[t] * cc[l]); ++ } ++ } ++ } ++ } ++ __syncthreads(); // RHS written cross-thread -> publish before the per-column solve ++ } else { ++ for (int t = 0; t < Cc; t++) { ++ float ks = 0.0f; // (S0^T k_t)[j] = sum_i S[i][j] k_t[i] ++ for (int i = 0; i < dk; i++) { ++ ks += Sd[j * dk + i] * Kc[t * dk + i]; ++ } ++ const float vtj = v_base[(c0 + t) * sv2 + j]; ++ Ud[j * C + t] = bet[t] * (vtj - gam[t] * ks); + } +- const float vtj = v_base[(c0 + t) * sv2 + j]; +- Ud[j * C + t] = bet[t] * (vtj - gam[t] * ks); ++ } ++ if constexpr (TC >= 3) { ++ // Zero the stale RHS tail (rows t >= Cc) before the full-K mma consumers (P*U at TC>=3; ++ // apply + state at TC>=4). Without this the masked tail terms compute 0*NaN = NaN. ++ for (int t = Cc; t < C; t++) Ud[j * C + t] = 0.0f; ++ __syncthreads(); + } + +- // --- solve A U = RHS in place (unit lower-tri fwd subst); per-thread, no inter-step sync --- +- for (int t = 1; t < Cc; t++) { +- float acc = Ud[j * C + t]; +- for (int tp = 0; tp < t; tp++) { +- acc -= Amat[t * Cc + tp] * Ud[j * C + tp]; ++ // --- solve A U = RHS (A unit-lower-tri) --- ++ if constexpr (TC >= 4) { ++ // M5/P7: form T = A^{-1} explicitly (FLA UT transform), then U = T*RHS as one ++ // dependency-free tf32 GEMM. At C<=16 A is a single b=16 block, so the off-diagonal ++ // Phase-O is empty; only the f32 in-shared diagonal inverse (Phase-D) + the wide ++ // apply remain. Phase-D: column-parallel EXACT f32 inverse of the Cc x Cc unit- ++ // lower-tri A -- thread c solves A x = e_c, writing column c of T into KKsh (free ++ // since KK was consumed into A). This is the strong-coupling amplifier -> f32. ++ if (j < C) { ++ if (j < Cc) { ++ float x[C]; ++ #pragma unroll ++ for (int r = 0; r < C; r++) x[r] = 0.0f; ++ x[j] = 1.0f; ++ for (int r = j + 1; r < Cc; r++) { ++ float acc = 0.0f; ++ for (int m = j; m < r; m++) acc += Amat[r * Cc + m] * x[m]; ++ x[r] = -acc; ++ } ++ #pragma unroll ++ for (int r = 0; r < C; r++) KKsh[r * C + j] = x[r]; // rows >= Cc are 0 ++ } else { ++ #pragma unroll ++ for (int r = 0; r < C; r++) KKsh[r * C + j] = 0.0f; // cols >= Cc are 0 ++ } ++ } ++ __syncthreads(); ++ // Apply U = T*RHS, M=C N=dv K=C; T=KKsh (stride C), RHS=Ud (stride C). In place on ++ // Ud: hold every output tile in registers, sync to finish the RHS reads, then ++ // overwrite Ud with U (avoids the read/write aliasing of a same-buffer GEMM). ++ { ++ const int w = threadIdx.x >> 5; ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; ++ const int lt = lane & 3; ++ constexpr int NWARP = S_v / 32; ++ constexpr int NT = dv / 8; ++ constexpr int NTPW = (NT + NWARP - 1) / NWARP; ++ float ureg[NTPW][4]; ++ #pragma unroll ++ for (int nn = 0; nn < NTPW; nn++) { ++ const int nt = w * NTPW + nn; ++ if (nt < NT) gdn_gram_tile_mma(ureg[nn], KKsh, Ud, 0, nt * 8, lg, lt); ++ } ++ __syncthreads(); // all RHS(Ud) reads done before overwriting with U ++ #pragma unroll ++ for (int nn = 0; nn < NTPW; nn++) { ++ const int nt = w * NTPW + nn; ++ if (nt >= NT) continue; ++ const int colbase = nt * 8; ++ const int tt[4] = {lg, lg, lg + 8, lg + 8}; ++ const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ const int t = tt[l], jc = jj[l]; ++ if (t < Cc && jc < dv) Ud[jc * C + t] = ureg[nn][l]; ++ } ++ } ++ __syncthreads(); + } +- Ud[j * C + t] = acc; ++ } else { ++ for (int t = 1; t < Cc; t++) { ++ float acc = Ud[j * C + t]; ++ for (int tp = 0; tp < t; tp++) { ++ acc -= Amat[t * Cc + tp] * Ud[j * C + tp]; ++ } ++ Ud[j * C + t] = acc; ++ } ++ __syncthreads(); // U finalized; Amat free for P below + } +- __syncthreads(); // U finalized; Amat free for P below (and Ud read across-thread? no, own col) + +- // --- P[t][t'] = d(t',t) * (q_t . k_t') for t' <= t (reuse Amat) --- +- for (int e = j; e < Cc * Cc; e += dv) { +- const int t = e / Cc; +- const int tp = e % Cc; +- float p = 0.0f; +- if (tp <= t) { +- float qk = 0.0f; +- for (int i = 0; i < dk; i++) { +- qk += Qc[t * dk + i] * Kc[tp * dk + i]; ++ // --- P[t][t'] = d(t',t) * (q_t . k_t') for t' <= t --- ++ if constexpr (TC >= 3) { ++ // M4: build P (lower-tri, decay pre-baked in f32 -> bounded) IN PLACE in QKsh at ++ // fixed stride C so the P*U output mma can read it as a tf32 A-operand. Full C*C ++ // grid: upper-tri / out-of-range entries are zeroed so the K=C mma needs no masking. ++ for (int e = j; e < C * C; e += dv) { ++ const int t = e / C; ++ const int tp = e % C; ++ float p = 0.0f; ++ if (tp <= t && t < Cc && tp < Cc) { ++ const float dd = expf(csh[t] - csh[tp]); ++ p = dd * QKsh[t * C + tp]; // QKsh holds QK (M2); overwrite in place with P ++ } ++ QKsh[t * C + tp] = p; ++ } ++ __syncthreads(); ++ } else { ++ for (int e = j; e < Cc * Cc; e += dv) { ++ const int t = e / Cc; ++ const int tp = e % Cc; ++ float p = 0.0f; ++ if (tp <= t) { ++ float qk; ++ if constexpr (TC >= 1) { ++ qk = QKsh[t * C + tp]; ++ } else { ++ qk = 0.0f; ++ for (int i = 0; i < dk; i++) { ++ qk += Qc[t * dk + i] * Kc[tp * dk + i]; ++ } ++ } ++ const float dd = expf(csh[t] - csh[tp]); ++ p = dd * qk; + } +- const float dd = expf(csh[t] - csh[tp]); +- p = dd * qk; ++ Amat[t * Cc + tp] = p; + } +- Amat[t * Cc + tp] = p; ++ __syncthreads(); + } +- __syncthreads(); + + // --- O[t][j] = gamma_t (S0^T q_t)[j] + sum_{t'<=t} P[t][t'] U[t'][j] (* scale) --- +- for (int t = 0; t < Cc; t++) { +- float qs = 0.0f; // (S0^T q_t)[j] (uses pre-update S) +- for (int i = 0; i < dk; i++) { +- qs += Sd[j * dk + i] * Qc[t * dk + i]; ++ if constexpr (TC >= 2) { ++ // M3: fused tensor-core QS = Qc * S0 (3xtf32, pre-update S0). Deposit the ++ // gamma_t*QS[t][j] cross-chunk term into dst from the mma registers; the O loop ++ // below reads it back (published via __syncthreads) and adds the intra-chunk P*U. ++ const int w = threadIdx.x >> 5; ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; ++ const int lt = lane & 3; ++ constexpr int NWARP = S_v / 32; ++ constexpr int NT = dv / 8; ++ constexpr int NTPW = (NT + NWARP - 1) / NWARP; ++ #pragma unroll ++ for (int mt = 0; mt < (C + 15) / 16; mt++) { ++ const int rowbase = mt * 16; ++ #pragma unroll ++ for (int nn = 0; nn < NTPW; nn++) { ++ const int nt = w * NTPW + nn; ++ if (nt >= NT) break; ++ const int colbase = nt * 8; ++ float cc[4]; ++ gdn_gram_tile_mma_3x(cc, Qc, Sd, rowbase, colbase, lg, lt); ++ const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; ++ const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ const int t = tt[l], jc = jj[l]; ++ if (t < Cc && jc < dv) { ++ attn_base[(c0 + t) * S_v * H + jc] = gam[t] * cc[l]; ++ } ++ } ++ } + } +- float o = gam[t] * qs; +- for (int tp = 0; tp <= t; tp++) { +- o += Amat[t * Cc + tp] * Ud[j * C + tp]; ++ __syncthreads(); ++ } ++ if constexpr (TC >= 3) { ++ // M4: O += P*U via tensor-core (tf32-safe: P is f32-bounded, decay pre-baked). ++ // GEMM O[t][j] += sum_t' P[t][t']*U[t'][j], M=C N=dv K=C; P=QKsh (stride C), ++ // U=Ud (stride C). The gamma_t*QS cross-chunk term was deposited into dst above; ++ // fold it in here then * scale. Warp w owns dv n-tiles [w*NTPW, ..). ++ const int w = threadIdx.x >> 5; ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; ++ const int lt = lane & 3; ++ constexpr int NWARP = S_v / 32; ++ constexpr int NT = dv / 8; ++ constexpr int NTPW = (NT + NWARP - 1) / NWARP; ++ #pragma unroll ++ for (int mt = 0; mt < (C + 15) / 16; mt++) { ++ const int rowbase = mt * 16; ++ #pragma unroll ++ for (int nn = 0; nn < NTPW; nn++) { ++ const int nt = w * NTPW + nn; ++ if (nt >= NT) break; ++ const int colbase = nt * 8; ++ float cc[4]; ++ gdn_gram_tile_mma(cc, QKsh, Ud, rowbase, colbase, lg, lt); ++ const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; ++ const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ const int t = tt[l], jc = jj[l]; ++ if (t < Cc && jc < dv) { ++ const int64_t oi = (int64_t)(c0 + t) * S_v * H + jc; ++ attn_base[oi] = (attn_base[oi] + cc[l]) * scale; // QS term + P*U ++ } ++ } ++ } ++ } ++ } else { ++ for (int t = 0; t < Cc; t++) { ++ float o; ++ if constexpr (TC >= 2) { ++ o = attn_base[(c0 + t) * S_v * H + j]; // gamma_t*QS[t][j] deposited above ++ } else { ++ float qs = 0.0f; // (S0^T q_t)[j] (uses pre-update S) ++ for (int i = 0; i < dk; i++) { ++ qs += Sd[j * dk + i] * Qc[t * dk + i]; ++ } ++ o = gam[t] * qs; ++ } ++ for (int tp = 0; tp <= t; tp++) { ++ o += Amat[t * Cc + tp] * Ud[j * C + tp]; ++ } ++ attn_base[(c0 + t) * S_v * H + j] = o * scale; + } +- attn_base[(c0 + t) * S_v * H + j] = o * scale; + } + + // --- S_C[i][j] = gamma_{C-1} S[i][j] + sum_t d(t,C-1) k_t[i] u_t[j] --- + const float glast = gam[Cc - 1]; + const float cslast = csh[Cc - 1]; +- for (int i = 0; i < dk; i++) { +- float s = glast * Sd[j * dk + i]; +- for (int t = 0; t < Cc; t++) { +- const float dd = expf(cslast - csh[t]); // d(t, last) +- s += dd * Kc[t * dk + i] * Ud[j * C + t]; ++ if constexpr (TC >= 4) { ++ // M5/P6: state carry S_C = glast*S0 + Kc^T * DU via 3xtf32 mma. DU[t][j] = ++ // d(t,last)*U[t][j] is built IN PLACE in Ud (t>=Cc zeroed so the K=C contraction ++ // needs no per-k masking), then S_C accumulates over the chunk dim t. Kc is read ++ // transposed (i as M-row). M=dk N=dv K=C. Each output (i,j) has a unique owner so ++ // the glast*S0 read-modify-write is race-free. ++ for (int t = 0; t < C; t++) { ++ const float dd = (t < Cc) ? expf(cslast - csh[t]) : 0.0f; ++ Ud[j * C + t] = dd * Ud[j * C + t]; // thread j owns column j -> DU in place ++ } ++ __syncthreads(); ++ const int w = threadIdx.x >> 5; ++ const int lane = threadIdx.x & 31; ++ const int lg = lane >> 2; ++ const int lt = lane & 3; ++ constexpr int NWARP = S_v / 32; ++ constexpr int MT = dk / 16; // m-tiles over dk ++ constexpr int NT = dv / 8; // n-tiles over dv ++ constexpr int NTILES = MT * NT; ++ constexpr int TPW = (NTILES + NWARP - 1) / NWARP; ++ #pragma unroll ++ for (int idx = 0; idx < TPW; idx++) { ++ const int tile = w * TPW + idx; ++ if (tile >= NTILES) break; ++ const int rowbase = (tile / NT) * 16; ++ const int colbase = (tile % NT) * 8; ++ float cc[4]; ++ gdn_state_tile_mma_3x(cc, Kc, Ud, rowbase, colbase, lg, lt); ++ const int ii[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; ++ const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; ++ #pragma unroll ++ for (int l = 0; l < 4; l++) { ++ const int i = ii[l], jc = jj[l]; ++ Sd[jc * dk + i] = glast * Sd[jc * dk + i] + cc[l]; ++ } ++ } ++ } else { ++ for (int i = 0; i < dk; i++) { ++ float s = glast * Sd[j * dk + i]; ++ for (int t = 0; t < Cc; t++) { ++ const float dd = expf(cslast - csh[t]); // d(t, last) ++ s += dd * Kc[t * dk + i] * Ud[j * C + t]; ++ } ++ Sd[j * dk + i] = s; + } +- Sd[j * dk + i] = s; + } + __syncthreads(); // Sd reused as S0 of next chunk; Kc/Qc/Amat reloaded next chunk + } +@@ -464,8 +861,7 @@ __global__ void gated_delta_net_chunked_cuda( + st[j * dk + i] = Sd[j * dk + i]; + } + } +- +-template ++template + static void launch_gdn_chunked( + const float * q_d, const float * k_d, const float * v_d, + const float * g_d, const float * b_d, const float * s_d, +@@ -477,10 +873,11 @@ static void launch_gdn_chunked( + const uint3 neqk1_magic, const uint3 rq3_magic, + float scale, cudaStream_t stream) { + const size_t smem = ((size_t) S_v * S_v + (size_t) 2 * C * S_v + (size_t) S_v * C +- + (size_t) C * C + (size_t) 3 * C) * sizeof(float); ++ + (size_t) C * C + (size_t) 3 * C ++ + (TC >= 1 ? (size_t) 2 * C * C : (size_t) 0)) * sizeof(float); + static bool attr_set = false; + if (!attr_set) { +- const cudaError_t e = cudaFuncSetAttribute(gated_delta_net_chunked_cuda, ++ const cudaError_t e = cudaFuncSetAttribute(gated_delta_net_chunked_cuda, + cudaFuncAttributeMaxDynamicSharedMemorySize, (int) smem); + if (e != cudaSuccess) { + GGML_ABORT("gdn chunked: cudaFuncSetAttribute(maxDynSmem=%zu) failed: %s\n", smem, cudaGetErrorString(e)); +@@ -489,7 +886,7 @@ static void launch_gdn_chunked( + } + dim3 grid_dims(H, n_seqs, 1); + dim3 block_dims(S_v, 1, 1); +- gated_delta_net_chunked_cuda<<>>( ++ gated_delta_net_chunked_cuda<<>>( + q_d, k_d, v_d, g_d, b_d, s_d, dst_d, H, n_tokens, n_seqs, + sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, + neqk1_magic, rq3_magic, scale, state_dst_d, ids_d, rs_head); +@@ -519,17 +916,52 @@ static void launch_gated_delta_net( + // sequential recurrence. Mathematically equivalent up to FP reduction order (NEW per-path md5; + // validated benign by test-backend-ops NMSE + greedy output). Toggle: GDN_CHUNK_OFF / GDN_CHUNK_MIN. + if constexpr (!KDA && !keep_rs_t) { +- // OPT-IN: this chunked path is bit-exact-benign (test-backend-ops green) but, at C=16 +- // (forced by GB10 99KB dyn-smem opt-in, all-shared), it is NOT yet faster than the tuned +- // sequential recurrence on this model (measured ~22%% slower S_PP, grid-starved at low +- // n_seqs + 1 block/SM occupancy). Default OFF so the backend default is regression-free; +- // enable for experiments / tuning with GDN_CHUNK_MIN=. See README section 5 (dev notes / rejected-flat levers). +- static const int gdn_chunk_min = []{ const char * e = getenv("GDN_CHUNK_MIN"); return e ? atoi(e) : INT_MAX; }(); ++ // DEFAULT-ON UNDER PAGED KV (f32-only re-port of patch 0044's M5). The M5 tensor-core path ++ // (GDN_TC=5: full-TC form-T solve + state-update mma, state in the 64KB smem buffer, C=16) ++ // is greedy-bit-exact (per-path md5 == the sequential canonical on the short gate prompt) ++ // and *beats* the tuned sequential recurrence on the Qwen3.6 MoE prefill: GB10, ++ // q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, -ntg 4 -npl 32: ++ // -npp 512 : S_PP +3.5% ; -npp 2048 : S_PP +17.7% (more chunks to parallelize). ++ // Decode S_TG is unchanged (1-token calls never reach the engage threshold). ++ // GDN_CHUNK_MIN is the per-call engage threshold and MUST stay > 1: decode is 1 token/call, ++ // so any threshold above 1 leaves every decode step on the sequential recurrence (at ++ // GDN_CHUNK_MIN=1 the chunked path swallows decode and collapses S_TG by ~25%). Tuned to 64: ++ // above decode/tiny-call sizes, below the real MoE-prefill per-call count. OFF (INT_MAX) when ++ // not paged, so the stock / non-paged default is regression-free. Both knobs env-overridable. ++ static const bool kv_paged = (getenv("LLAMA_KV_PAGED") != nullptr); ++ static const int gdn_chunk_min = []{ ++ const char * e = getenv("GDN_CHUNK_MIN"); ++ if (e) return atoi(e); ++ return kv_paged ? 64 : INT_MAX; ++ }(); ++ // Tensor-core level selector (single build, clean runtime A/B). GDN_TC: ++ // 0 = serial scan (patch 0031); 1 = KK/QK Gram mma (M2); ++ // 2 = + KS/QS state-boundary mma, 3xtf32 (M3); 3 = + P*U output mma (M4); ++ // 4/5+ = M5 (full TC: form-T solve + state-update mma) - the DEFAULT under paged KV. ++ // (The bf16 CONFIG-C and register-resident M6/M7/M8 occupancy variants of patch 0044 are ++ // intentionally absent from this f32-only series; the +3.5/+17.7% prefill win is the M5 path.) ++ // GDN_GRAM_MMA=1 is kept as an alias for level 1. ++ static const int gdn_tc = []{ ++ const char * e = getenv("GDN_TC"); ++ if (e) return atoi(e); ++ const char * gm = getenv("GDN_GRAM_MMA"); ++ if (gm && atoi(gm) != 0) return 1; ++ return kv_paged ? 5 : 0; ++ }(); + if (S_v == 128 && n_tokens >= gdn_chunk_min) { +- launch_gdn_chunked<128, 16>( +- q_d, k_d, v_d, g_d, b_d, (const float *) s_d, dst_d, (float *) state_dst_d, ids_d, rs_head, +- H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, +- neqk1_magic, rq3_magic, scale, stream); ++#define GDN_CHUNKED_LAUNCH(TC_) \ ++ launch_gdn_chunked<128, 16, TC_>( \ ++ q_d, k_d, v_d, g_d, b_d, (const float *) s_d, dst_d, (float *) state_dst_d, ids_d, rs_head, \ ++ H, n_tokens, n_seqs, sq1, sq2, sq3, sv1, sv2, sv3, sb1, sb2, sb3, \ ++ neqk1_magic, rq3_magic, scale, stream) ++ switch (gdn_tc) { ++ case 0: GDN_CHUNKED_LAUNCH(0); break; ++ case 1: GDN_CHUNKED_LAUNCH(1); break; ++ case 2: GDN_CHUNKED_LAUNCH(2); break; ++ case 3: GDN_CHUNKED_LAUNCH(3); break; ++ default: GDN_CHUNKED_LAUNCH(4); break; // GDN_TC >= 4 -> M5 (full TC, kernel TC=4) ++ } ++#undef GDN_CHUNKED_LAUNCH + return; + } + } +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 4e40d2353..817069860 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -9372,6 +9372,11 @@ static std::vector> make_test_cases_eval() { + } + + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 1, 1)); ++ // Tensor-core chunked-GDN prefill path (S_v==128): multi-chunk (C=16) coverage, ++ // incl. a tail chunk (100 = 6*16+4) and multi-seq. Exercised via GDN_CHUNK_MIN + GDN_TC. ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 64, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 100, 1)); ++ test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 128, 128, 2)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 16, 1, 1)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 16, 1, 1, 1, true, true)); + test_cases.emplace_back(new test_gated_delta_net(GGML_TYPE_F32, 32, 16, 1, 1, 1, false, true)); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0048-test-paged-cover-MoE-swiglu-down-chain.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0048-test-paged-cover-MoE-swiglu-down-chain.patch new file mode 100644 index 000000000000..8a0f6ec8f15f --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0048-test-paged-cover-MoE-swiglu-down-chain.patch @@ -0,0 +1,124 @@ +From fd920cf8a7fe9cc7753cd0640411ce771edfeaca Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 23:18:38 +0000 +Subject: [PATCH 48/52] test(paged): cover MoE swiglu down chain + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tests/test-backend-ops.cpp | 92 ++++++++++++++++++++++++++++++++++++++ + 1 file changed, 92 insertions(+) + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 817069860..aeca64802 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -4447,6 +4447,91 @@ struct test_mul_mat_id_fusion : public test_case { + } + }; + ++// Merged MoE gate_up -> SWIGLU -> down MUL_MAT_ID chain. ++struct test_moe_swiglu_down : public test_case { ++ const ggml_type type_a; ++ const int n_mats; ++ const int n_used; ++ const int64_t n_ff; ++ const int64_t n_tokens; ++ const int64_t n_embd; ++ ++ std::string vars() override { ++ return VARS_TO_STR6(type_a, n_mats, n_used, n_ff, n_tokens, n_embd); ++ } ++ ++ double max_nmse_err() override { ++ return 5e-4; ++ } ++ ++ double max_nmse_err(ggml_backend_t backend) override { ++ if ((type_a == GGML_TYPE_MXFP4 || type_a == GGML_TYPE_NVFP4) && backend_has_feature(backend, "BLACKWELL_NATIVE_FP4")) { ++ // This whole-graph gate compounds two native-FP4 MUL_MAT_ID ops with ++ // SWIGLU between them, so it needs slightly more room than the ++ // single-op FP4 MUL_MAT_ID gate. ++ return 2.5e-2; ++ } ++ return max_nmse_err(); ++ } ++ ++ uint64_t op_flops(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return 2 * n_ff * n_embd * n_tokens * n_used * 3; ++ } ++ ++ test_moe_swiglu_down(ggml_type type_a = GGML_TYPE_F32, int n_mats = 128, int n_used = 8, ++ int64_t n_ff = 768, int64_t n_tokens = 128, int64_t n_embd = 2048) ++ : type_a(type_a), n_mats(n_mats), n_used(n_used), n_ff(n_ff), n_tokens(n_tokens), n_embd(n_embd) { ++ GGML_ASSERT(n_used <= n_mats); ++ } ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ ggml_tensor * gate_up = ggml_new_tensor_3d(ctx, type_a, n_embd, 2 * n_ff, n_mats); ++ ggml_set_name(gate_up, "gate_up"); ++ ++ ggml_tensor * down = ggml_new_tensor_3d(ctx, type_a, n_ff, n_embd, n_mats); ++ ggml_set_name(down, "down"); ++ ++ ggml_tensor * ids = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, n_mats, n_tokens); ++ ggml_set_name(ids, "ids"); ++ if (n_used != n_mats) { ++ ids = ggml_view_2d(ctx, ids, n_used, n_tokens, ids->nb[1], 0); ++ ggml_set_name(ids, "view_of_ids"); ++ } ++ ++ ggml_tensor * cur = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, n_embd, n_used, n_tokens); ++ ggml_set_name(cur, "cur"); ++ ++ ggml_tensor * gate_up_out = ggml_mul_mat_id(ctx, gate_up, cur, ids); ++ ggml_set_name(gate_up_out, "gate_up_out"); ++ ++ ggml_tensor * gate = ggml_view_3d(ctx, gate_up_out, n_ff, n_used, n_tokens, gate_up_out->nb[1], gate_up_out->nb[2], 0); ++ ggml_set_name(gate, "gate"); ++ ++ ggml_tensor * up = ggml_view_3d(ctx, gate_up_out, n_ff, n_used, n_tokens, gate_up_out->nb[1], gate_up_out->nb[2], n_ff * gate_up_out->nb[0]); ++ ggml_set_name(up, "up"); ++ ++ ggml_tensor * act = ggml_swiglu_split(ctx, gate, up); ++ ggml_set_name(act, "swiglu"); ++ ++ ggml_tensor * out = ggml_mul_mat_id(ctx, down, act, ids); ++ ggml_set_name(out, "out"); ++ ++ return out; ++ } ++ ++ void initialize_tensors(ggml_context * ctx) override { ++ init_mul_mat_id_tensors(ctx, n_mats); ++ } ++ ++ bool run_whole_graph() override { return true; } ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "MOE_SWIGLU_DOWN"; ++ } ++}; ++ + // GGML_OP_OUT_PROD + struct test_out_prod : public test_case { + const ggml_type type_a; +@@ -8759,6 +8844,13 @@ static std::vector> make_test_cases_eval() { + } + } + ++ // [paged Phase 7] Merged MoE gate_up -> SWIGLU -> down projection gate for the ++ // serving candidate that fuses SWIGLU into NVFP4 down-input quantization. ++ test_cases.emplace_back(new test_moe_swiglu_down(GGML_TYPE_F32, 8, 2, 32, 8, 64)); ++ for (int n : {16, 33, 64, 128, 130, 200}) { ++ test_cases.emplace_back(new test_moe_swiglu_down(GGML_TYPE_NVFP4, 128, 8, 768, n, 2048)); ++ } ++ + // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate. + // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the + // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-test-paged-cover-MoE-weighted-combine-chain.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-test-paged-cover-MoE-weighted-combine-chain.patch new file mode 100644 index 000000000000..2239f72c9f79 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0049-test-paged-cover-MoE-weighted-combine-chain.patch @@ -0,0 +1,122 @@ +From a85c1e098e22eb587fd80220986f35a8d6e11300 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Tue, 30 Jun 2026 23:50:33 +0000 +Subject: [PATCH 49/52] test(paged): cover MoE weighted combine chain + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tests/test-backend-ops.cpp | 90 ++++++++++++++++++++++++++++++++++++++ + 1 file changed, 90 insertions(+) + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index aeca64802..71740ce9f 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -4532,6 +4532,89 @@ struct test_moe_swiglu_down : public test_case { + } + }; + ++// MoE down projection -> router-weight multiply -> rank-ordered expert add. ++struct test_moe_weighted_combine : public test_case { ++ const ggml_type type_a; ++ const int n_mats; ++ const int n_used; ++ const int64_t n_ff; ++ const int64_t n_tokens; ++ const int64_t n_embd; ++ ++ std::string vars() override { ++ return VARS_TO_STR6(type_a, n_mats, n_used, n_ff, n_tokens, n_embd); ++ } ++ ++ double max_nmse_err() override { ++ return 5e-4; ++ } ++ ++ double max_nmse_err(ggml_backend_t backend) override { ++ if ((type_a == GGML_TYPE_MXFP4 || type_a == GGML_TYPE_NVFP4) && backend_has_feature(backend, "BLACKWELL_NATIVE_FP4")) { ++ return 2e-2; ++ } ++ return max_nmse_err(); ++ } ++ ++ uint64_t op_flops(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return 2 * n_ff * n_embd * n_tokens * n_used + 2 * n_embd * n_tokens * n_used; ++ } ++ ++ test_moe_weighted_combine(ggml_type type_a = GGML_TYPE_F32, int n_mats = 128, int n_used = 8, ++ int64_t n_ff = 768, int64_t n_tokens = 128, int64_t n_embd = 2048) ++ : type_a(type_a), n_mats(n_mats), n_used(n_used), n_ff(n_ff), n_tokens(n_tokens), n_embd(n_embd) { ++ GGML_ASSERT(n_used <= n_mats); ++ } ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ ggml_tensor * down = ggml_new_tensor_3d(ctx, type_a, n_ff, n_embd, n_mats); ++ ggml_set_name(down, "down"); ++ ++ ggml_tensor * ids = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, n_mats, n_tokens); ++ ggml_set_name(ids, "ids"); ++ if (n_used != n_mats) { ++ ids = ggml_view_2d(ctx, ids, n_used, n_tokens, ids->nb[1], 0); ++ ggml_set_name(ids, "view_of_ids"); ++ } ++ ++ ggml_tensor * act = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, n_ff, n_used, n_tokens); ++ ggml_set_name(act, "act"); ++ ++ ggml_tensor * weights = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, 1, n_used, n_tokens); ++ ggml_set_name(weights, "weights"); ++ ++ ggml_tensor * experts = ggml_mul_mat_id(ctx, down, act, ids); ++ ggml_set_name(experts, "down_out"); ++ ++ experts = ggml_mul(ctx, experts, weights); ++ ggml_set_name(experts, "weighted"); ++ ++ ggml_tensor * out = ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], 0); ++ ggml_set_name(out, "rank_0"); ++ ++ for (int i = 1; i < n_used; ++i) { ++ ggml_tensor * rank = ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], i*experts->nb[1]); ++ ggml_set_name(rank, "rank_i"); ++ out = ggml_add(ctx, out, rank); ++ ggml_set_name(out, "rank_sum"); ++ } ++ ++ return out; ++ } ++ ++ void initialize_tensors(ggml_context * ctx) override { ++ init_mul_mat_id_tensors(ctx, n_mats); ++ } ++ ++ bool run_whole_graph() override { return true; } ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "MOE_WEIGHTED_COMBINE"; ++ } ++}; ++ + // GGML_OP_OUT_PROD + struct test_out_prod : public test_case { + const ggml_type type_a; +@@ -8851,6 +8934,13 @@ static std::vector> make_test_cases_eval() { + test_cases.emplace_back(new test_moe_swiglu_down(GGML_TYPE_NVFP4, 128, 8, 768, n, 2048)); + } + ++ // [paged Phase 7] MoE down projection -> router-weight multiply -> rank-ordered ++ // expert add gate for the weighted-combine fusion candidate. ++ test_cases.emplace_back(new test_moe_weighted_combine(GGML_TYPE_F32, 8, 2, 32, 8, 64)); ++ for (int n : {16, 33, 64, 128, 130, 200}) { ++ test_cases.emplace_back(new test_moe_weighted_combine(GGML_TYPE_NVFP4, 128, 8, 768, n, 2048)); ++ } ++ + // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate. + // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the + // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-test-paged-cover-ragged-MoE-dispatch.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-test-paged-cover-ragged-MoE-dispatch.patch new file mode 100644 index 000000000000..ab9554ef4d4e --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0050-test-paged-cover-ragged-MoE-dispatch.patch @@ -0,0 +1,150 @@ +From 2fed6aacff14537864bbf754c7552740131d4eaf Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 1 Jul 2026 00:39:52 +0000 +Subject: [PATCH 50/52] test(paged): cover ragged MoE dispatch + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tests/test-backend-ops.cpp | 118 +++++++++++++++++++++++++++++++++++++ + 1 file changed, 118 insertions(+) + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 71740ce9f..8c41ae56a 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -4615,6 +4615,115 @@ struct test_moe_weighted_combine : public test_case { + } + }; + ++// Ragged 256-expert MoE dispatch gate for serving decode. ++struct test_mul_mat_id_ragged_moe : public test_case { ++ const ggml_type type_a; ++ const int n_mats; ++ const int n_used; ++ const int64_t m; ++ const int64_t n; ++ const int64_t k; ++ ++ std::string vars() override { ++ return VARS_TO_STR6(type_a, n_mats, n_used, m, n, k); ++ } ++ ++ double max_nmse_err() override { ++ return 5e-4; ++ } ++ ++ double max_nmse_err(ggml_backend_t backend) override { ++ if ((type_a == GGML_TYPE_MXFP4 || type_a == GGML_TYPE_NVFP4) && backend_has_feature(backend, "BLACKWELL_NATIVE_FP4")) { ++ return 2e-2; ++ } ++ return max_nmse_err(); ++ } ++ ++ uint64_t op_flops(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return 2 * m * k * n * n_used; ++ } ++ ++ test_mul_mat_id_ragged_moe(ggml_type type_a = GGML_TYPE_NVFP4, int n_mats = 256, int n_used = 8, ++ int64_t m = 768, int64_t n = 128, int64_t k = 2048) ++ : type_a(type_a), n_mats(n_mats), n_used(n_used), m(m), n(n), k(k) { ++ GGML_ASSERT(n_used <= n_mats); ++ } ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ ggml_tensor * as = ggml_new_tensor_3d(ctx, type_a, k, m, n_mats); ++ ggml_set_name(as, "as"); ++ ++ ggml_tensor * ids = ggml_new_tensor_2d(ctx, GGML_TYPE_I32, n_mats, n); ++ ggml_set_name(ids, "ids"); ++ if (n_used != n_mats) { ++ ids = ggml_view_2d(ctx, ids, n_used, n, ids->nb[1], 0); ++ ggml_set_name(ids, "view_of_ids"); ++ } ++ ++ ggml_tensor * b = ggml_new_tensor_3d(ctx, GGML_TYPE_F32, k, n_used, n); ++ ggml_set_name(b, "b"); ++ ++ ggml_tensor * out = ggml_mul_mat_id(ctx, as, b, ids); ++ ggml_set_name(out, "out"); ++ ++ return out; ++ } ++ ++ void initialize_tensors(ggml_context * ctx) override { ++ for (ggml_tensor * t = ggml_get_first_tensor(ctx); t != nullptr; t = ggml_get_next_tensor(ctx, t)) { ++ if (ggml_is_view_op(t->op)) { ++ continue; ++ } ++ if (t->type != GGML_TYPE_I32) { ++ init_tensor_uniform(t); ++ continue; ++ } ++ ++ std::vector data(t->ne[0]); ++ for (int64_t token = 0; token < ggml_nrows(t); ++token) { ++ for (int64_t r = 0; r < t->ne[0]; ++r) { ++ data[r] = (int32_t) ((token * 17 + r * 31) % n_mats); ++ } ++ ++ if (n_used >= 8) { ++ // Skew rank 0 heavily to expert 0, exercise max expert id, ++ // leave many experts empty, and preserve unique top-k ids. ++ std::vector used(n_mats, false); ++ const int64_t seeds[8] = { ++ 0, ++ 1 + token % 4, ++ 4 + (token * 3) % 8, ++ n_mats - 1, ++ token * 5 + 7, ++ token * 7 + 11, ++ token * 13 + 19, ++ token * 29 + 23, ++ }; ++ ++ for (int64_t r = 0; r < 8; ++r) { ++ int32_t id = (int32_t) (seeds[r] % n_mats); ++ while (used[id]) { ++ id = (id + 1) % n_mats; ++ } ++ data[r] = id; ++ used[id] = true; ++ } ++ } ++ ++ ggml_backend_tensor_set(t, data.data(), token * t->nb[1], t->ne[0] * sizeof(int32_t)); ++ } ++ } ++ } ++ ++ bool run_whole_graph() override { return true; } ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "MUL_MAT_ID_RAGGED_MOE"; ++ } ++}; ++ + // GGML_OP_OUT_PROD + struct test_out_prod : public test_case { + const ggml_type type_a; +@@ -8941,6 +9050,15 @@ static std::vector> make_test_cases_eval() { + test_cases.emplace_back(new test_moe_weighted_combine(GGML_TYPE_NVFP4, 128, 8, 768, n, 2048)); + } + ++ // [paged Phase 8] Ragged 256-expert MoE dispatch gate for live serving decode. ++ // Deterministic ids skew many tokens into a few hot experts, include expert 255, ++ // and leave many experts empty. n=1 covers single-token decode; n=257 crosses ++ // the MMVQ/MMID batch cutoff while preserving top-8 routing. ++ test_cases.emplace_back(new test_mul_mat_id_ragged_moe(GGML_TYPE_F32, 16, 8, 32, 8, 64)); ++ for (int n : {1, 8, 33, 128, 257}) { ++ test_cases.emplace_back(new test_mul_mat_id_ragged_moe(GGML_TYPE_NVFP4, 256, 8, 768, n, 2048)); ++ } ++ + // [paged P0 / track B] NVFP4/MXFP4 dense decode-shape mmq_y-down bit-exact gate. + // The dense FP4 weight GEMM is the track-B target; P1 lowers mmq_y (the weight-row tile) on the + // NVFP4 decode path to raise resident-CTA occupancy. mmq_y is a pure N-row tiling knob, so a +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0051-fix-speculative-disable-backend-sampling-for-MTP-dra.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0051-fix-speculative-disable-backend-sampling-for-MTP-dra.patch new file mode 100644 index 000000000000..089e3c807077 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0051-fix-speculative-disable-backend-sampling-for-MTP-dra.patch @@ -0,0 +1,32 @@ +From f1d976f06fb92655106709256dd093ffefb85e2b Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Wed, 1 Jul 2026 00:50:36 +0000 +Subject: [PATCH 51/52] fix(speculative): disable backend sampling for MTP + drafts + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + common/speculative.cpp | 6 ++++++ + 1 file changed, 6 insertions(+) + +diff --git a/common/speculative.cpp b/common/speculative.cpp +index c922a3f59..626ede396 100644 +--- a/common/speculative.cpp ++++ b/common/speculative.cpp +@@ -952,6 +952,12 @@ struct common_speculative_impl_draft_mtp : public common_speculative_impl { + ctx_dft ? "yes" : "no", + common_speculative_get_devices_str(this->params.devices).c_str()); + ++ if (this->params.backend_sampling) { ++ LOG_WRN("%s: backend draft sampling is disabled for MTP; verification batches can request multiple output rows per sequence\n", ++ __func__); ++ this->params.backend_sampling = false; ++ } ++ + const int32_t n_b = (int32_t) llama_n_batch(ctx_dft); + batch = llama_batch_init(/*n_tokens=*/ n_b, /*embd=*/ n_embd, /*n_seq_max=*/ 1); + // llama_batch_init allocates only one of token/embd; MTP needs both. +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0052-feat-paged-whole-pattern-MoE-matcher-routed-FFN-fuse.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0052-feat-paged-whole-pattern-MoE-matcher-routed-FFN-fuse.patch new file mode 100644 index 000000000000..10acac46f054 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0052-feat-paged-whole-pattern-MoE-matcher-routed-FFN-fuse.patch @@ -0,0 +1,933 @@ +From 1edddc8fe93bb2fec5f831bbde5df2b7480a7b05 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 2 Jul 2026 12:15:38 +0200 +Subject: [PATCH 52/52] feat(paged): whole-pattern MoE matcher + routed-FFN + fused NVFP4-quant down MMQ + +Add the routed-FFN fused-quant line for the paged NVFP4 MoE decode step, +all default-off and md5-clean, gating a fused SwiGLU-to-NVFP4-quant plus a +raw pre-quantized down-projection MMQ that skips the intermediate F32 +materialize + re-quantize of the standard gate_up -> SwiGLU -> down chain. + +Pieces (all guarded, no effect unless explicitly enabled): + +- Whole-pattern MoE matcher + executor hook in ggml_cuda_try_fuse + (LLAMA_MOE_WHOLE_PATTERN_EXEC and the *_TRACE/_EARLY_TRACE diagnostics). + Detects the gate_up(MUL_MAT_ID) -> view/view -> SwiGLU(GLU) -> down(MUL_MAT_ID) + sub-graph early in the fusion pass and, when engaged, runs the whole chain + through a single executor instead of node-by-node. + +- Routed-FFN PoC scaffold ggml/src/ggml-cuda/moe-ffn.{cu,cuh} + a narrow hook + (LLAMA_MOE_ROUTED_FFN_POC). ggml_cuda_compute_forward is de-static-ed so the + executor translation unit can drive the standard op path for the fallback + legs. The executor tries the fused-quant path first, else falls back to the + stock compute_forward for glu + down (bit-identical to default). + +- Fused SwiGLU-to-NVFP4-quant + raw down MMQ (LLAMA_MOE_ROUTED_FFN_FUSED_QUANT): + moe_swiglu_nvfp4_quant_kernel writes block_fp4_mmq activations directly, then + ggml_cuda_mul_mat_q_moe_quantized (with the local ggml_cuda_mmq_ids_meta + refactor of the expert-sorted ids/bounds prep) runs the down GEMM on the + pre-quantized rows. Native FP4 (Blackwell) only; NVFP4 down weights only. + +Gated on GB10 (sm_121a), before/after this commit: +- Canonical default-path greedy md5 unchanged: MoE q36-35b-a3b-nvfp4 + 8cb0ce23777bf55f92f63d0292c756b0, dense q36-27b-nvfp4 + 5951a5b4d624ce891e22ab5fca9bc439. +- md5-clean opt-in: LLAMA_MOE_ROUTED_FFN_POC=1 LLAMA_MOE_ROUTED_FFN_FUSED_QUANT=1 + keeps the MoE md5 byte-identical (8cb0ce23...). +- test-backend-ops: MUL_MAT 1146/1146, MUL_MAT_ID 806/806, GATED_DELTA_NET 46/46, + and the MoE sentinels MOE_SWIGLU_DOWN 7/7 + MUL_MAT_ID_RAGGED_MOE 6/6 pass both + default and opt-in. Opt-in emits exactly six route=mmq_moe_quantized_raw markers + with zero mmq_moe_sorted_raw launches (fused path provably engaged). +- Serving effect is flat-to-slightly-positive and not a shipped default: + decode agg 326.9 -> 332.7 t/s, mmq_nvfp4 6009 -> 5915 ms, aggregate flat + (~/bench/phase135_routed_ffn_fused_quant_serving/20260702_082102). + +The rejected/neutral neighbours of this line (Phase133 sorted-F32 down, +Phase134 fused-SWIGLU-only, Phase138 finalize/weighted-combine fusion, the +W4A16 grouped-tile pack/tune/pad line, GPU-sort, boundary/layout/quant traces) +are deliberately excluded and carry no markers in this tree. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 304 +++++++++++++++++++++++++++++++- + ggml/src/ggml-cuda/mmq.cu | 148 ++++++++++++++++ + ggml/src/ggml-cuda/mmq.cuh | 25 +++ + ggml/src/ggml-cuda/moe-ffn.cu | 296 +++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/moe-ffn.cuh | 24 +++ + 5 files changed, 796 insertions(+), 1 deletion(-) + create mode 100644 ggml/src/ggml-cuda/moe-ffn.cu + create mode 100644 ggml/src/ggml-cuda/moe-ffn.cuh + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index 374949f25..ef1bdc3b4 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -32,6 +32,7 @@ + #include "ggml-cuda/im2col.cuh" + #include "ggml-cuda/mmf.cuh" + #include "ggml-cuda/mmq.cuh" ++#include "ggml-cuda/moe-ffn.cuh" + #include "ggml-cuda/mmvf.cuh" + #include "ggml-cuda/mmvq.cuh" + #include "ggml-cuda/norm.cuh" +@@ -2854,7 +2855,7 @@ static void ggml_cuda_mul_mat_id(ggml_backend_cuda_context & ctx, ggml_tensor * + nb1, nb2, nb3, stream); + } + +-static bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct ggml_tensor * dst) { ++bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, struct ggml_tensor * dst) { + switch (dst->op) { + case GGML_OP_ARGMAX: + ggml_cuda_argmax(ctx, dst); +@@ -4032,6 +4033,201 @@ static bool ggml_cuda_can_fuse(const struct ggml_cgraph * cgraph, + return false; + } + ++static inline const char * ggml_cuda_moe_wp_trace_tensor_name(const ggml_tensor * t) { ++ return t != nullptr && t->name[0] != '\0' ? t->name : "-"; ++} ++ ++static inline int ggml_cuda_moe_whole_pattern_trace_limit() { ++ static const int value = []() { ++ const char * s = getenv("LLAMA_MOE_WHOLE_PATTERN_TRACE"); ++ if (s == nullptr || strcmp(s, "0") == 0) { ++ return 0; ++ } ++ const int parsed = atoi(s); ++ return parsed > 0 ? parsed : 128; ++ }(); ++ ++ return value; ++} ++ ++static inline bool ggml_cuda_moe_whole_pattern_trace_take(std::atomic & counter) { ++ const int trace_limit = ggml_cuda_moe_whole_pattern_trace_limit(); ++ if (trace_limit <= 0) { ++ return false; ++ } ++ ++ const int trace_idx = counter.fetch_add(1, std::memory_order_relaxed); ++ return trace_idx < trace_limit; ++} ++ ++static inline int ggml_cuda_moe_whole_pattern_early_trace_limit() { ++ static const int value = []() { ++ const char * s = getenv("LLAMA_MOE_WHOLE_PATTERN_EARLY_TRACE"); ++ if (s == nullptr || strcmp(s, "0") == 0) { ++ return 0; ++ } ++ const int parsed = atoi(s); ++ return parsed > 0 ? parsed : 128; ++ }(); ++ ++ return value; ++} ++ ++static inline bool ggml_cuda_moe_whole_pattern_exec_enabled() { ++ static const bool value = []() { ++ const char * s = getenv("LLAMA_MOE_WHOLE_PATTERN_EXEC"); ++ return s != nullptr && atoi(s) != 0; ++ }(); ++ ++ return value; ++} ++ ++static inline int ggml_cuda_moe_whole_pattern_exec_trace_limit() { ++ static const int value = []() { ++ const char * s = getenv("LLAMA_MOE_WHOLE_PATTERN_EXEC_TRACE"); ++ if (s == nullptr || strcmp(s, "0") == 0) { ++ return 0; ++ } ++ const int parsed = atoi(s); ++ return parsed > 0 ? parsed : 128; ++ }(); ++ ++ return value; ++} ++ ++static inline bool ggml_cuda_moe_whole_pattern_exec_trace_take(std::atomic & counter) { ++ const int trace_limit = ggml_cuda_moe_whole_pattern_exec_trace_limit(); ++ if (trace_limit <= 0) { ++ return false; ++ } ++ ++ const int trace_idx = counter.fetch_add(1, std::memory_order_relaxed); ++ return trace_idx < trace_limit; ++} ++ ++struct ggml_cuda_moe_whole_pattern { ++ const ggml_tensor * gate_up = nullptr; ++ const ggml_tensor * gate = nullptr; ++ const ggml_tensor * up = nullptr; ++ const ggml_tensor * glu = nullptr; ++ const ggml_tensor * down = nullptr; ++ const ggml_tensor * ids = nullptr; ++ ++ bool view_pair = false; ++ bool ids_match = false; ++ bool swiglu = false; ++ bool supported_type = false; ++ bool supported = false; ++}; ++ ++static ggml_cuda_moe_whole_pattern ggml_cuda_moe_whole_pattern_detect(const ggml_tensor * glu, const ggml_tensor * down) { ++ ggml_cuda_moe_whole_pattern pattern{}; ++ pattern.glu = glu; ++ pattern.down = down; ++ ++ if (glu == nullptr || down == nullptr || glu->op != GGML_OP_GLU || down->op != GGML_OP_MUL_MAT_ID) { ++ return pattern; ++ } ++ ++ pattern.gate = glu->src[0]; ++ pattern.up = glu->src[1]; ++ pattern.ids = down->src[2]; ++ ++ pattern.view_pair = pattern.gate != nullptr && pattern.up != nullptr && ++ pattern.gate->op == GGML_OP_VIEW && pattern.up->op == GGML_OP_VIEW && ++ pattern.gate->view_src != nullptr && pattern.gate->view_src == pattern.up->view_src; ++ if (!pattern.view_pair) { ++ return pattern; ++ } ++ ++ pattern.gate_up = pattern.gate->view_src; ++ if (pattern.gate_up == nullptr || pattern.gate_up->op != GGML_OP_MUL_MAT_ID) { ++ return pattern; ++ } ++ ++ pattern.ids_match = pattern.gate_up->src[2] == pattern.ids; ++ pattern.swiglu = ggml_get_glu_op(glu) == GGML_GLU_OP_SWIGLU; ++ pattern.supported_type = down->src[0] != nullptr && ++ (down->src[0]->type == GGML_TYPE_NVFP4 || down->src[0]->type == GGML_TYPE_MXFP4); ++ pattern.supported = pattern.ids_match && pattern.swiglu && pattern.supported_type; ++ ++ return pattern; ++} ++ ++static ggml_cuda_moe_whole_pattern ggml_cuda_moe_whole_pattern_detect_early(const ggml_cgraph * cgraph, int i) { ++ ggml_cuda_moe_whole_pattern pattern{}; ++ ++ if (cgraph == nullptr || i + 4 >= cgraph->n_nodes) { ++ return pattern; ++ } ++ ++ const ggml_tensor * gate_up = cgraph->nodes[i + 0]; ++ const ggml_tensor * view0 = cgraph->nodes[i + 1]; ++ const ggml_tensor * view1 = cgraph->nodes[i + 2]; ++ const ggml_tensor * glu = cgraph->nodes[i + 3]; ++ const ggml_tensor * down = cgraph->nodes[i + 4]; ++ ++ pattern.gate_up = gate_up; ++ pattern.glu = glu; ++ pattern.down = down; ++ ++ if (gate_up == nullptr || view0 == nullptr || view1 == nullptr || glu == nullptr || down == nullptr || ++ gate_up->op != GGML_OP_MUL_MAT_ID || view0->op != GGML_OP_VIEW || view1->op != GGML_OP_VIEW || ++ glu->op != GGML_OP_GLU || down->op != GGML_OP_MUL_MAT_ID) { ++ return pattern; ++ } ++ ++ pattern.view_pair = view0->view_src == gate_up && view1->view_src == gate_up; ++ if (!pattern.view_pair) { ++ return pattern; ++ } ++ ++ if (glu->src[0] == view0 && glu->src[1] == view1) { ++ pattern.gate = view0; ++ pattern.up = view1; ++ } else if (glu->src[0] == view1 && glu->src[1] == view0) { ++ pattern.gate = view1; ++ pattern.up = view0; ++ } else { ++ return pattern; ++ } ++ ++ if (down->src[1] != glu) { ++ return pattern; ++ } ++ ++ pattern.ids = down->src[2]; ++ pattern.ids_match = gate_up->src[2] == pattern.ids; ++ pattern.swiglu = ggml_get_glu_op(glu) == GGML_GLU_OP_SWIGLU; ++ pattern.supported_type = down->src[0] != nullptr && ++ (down->src[0]->type == GGML_TYPE_NVFP4 || down->src[0]->type == GGML_TYPE_MXFP4); ++ pattern.supported = pattern.ids_match && pattern.swiglu && pattern.supported_type; ++ ++ return pattern; ++} ++ ++static bool ggml_cuda_moe_whole_pattern_exec_proof( ++ ggml_backend_cuda_context * cuda_ctx, ++ const ggml_cuda_moe_whole_pattern & pattern) { ++ GGML_ASSERT(cuda_ctx != nullptr); ++ GGML_ASSERT(pattern.supported); ++ GGML_ASSERT(pattern.gate_up != nullptr); ++ GGML_ASSERT(pattern.glu != nullptr); ++ GGML_ASSERT(pattern.down != nullptr); ++ ++ if (!ggml_cuda_compute_forward(*cuda_ctx, const_cast(pattern.gate_up))) { ++ return false; ++ } ++ if (!ggml_cuda_compute_forward(*cuda_ctx, const_cast(pattern.glu))) { ++ return false; ++ } ++ if (!ggml_cuda_compute_forward(*cuda_ctx, const_cast(pattern.down))) { ++ return false; ++ } ++ ++ return true; ++} ++ + // try and fuse nodes and return the number of nodes to skip + static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph * cgraph, int i) { + +@@ -4042,6 +4238,112 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph + + ggml_tensor * node = cgraph->nodes[i]; + ++ static std::atomic moe_whole_pattern_early_trace_count{0}; ++ const bool routed_ffn_poc = ggml_cuda_moe_routed_ffn_poc_enabled(); ++ const bool whole_pattern_exec = ggml_cuda_moe_whole_pattern_exec_enabled(); ++ const int whole_pattern_early_trace_limit = ggml_cuda_moe_whole_pattern_early_trace_limit(); ++ if (node->op == GGML_OP_MUL_MAT_ID && ++ (routed_ffn_poc || whole_pattern_exec || whole_pattern_early_trace_limit > 0)) { ++ const ggml_cuda_moe_whole_pattern pattern = ggml_cuda_moe_whole_pattern_detect_early(cgraph, i); ++ if (pattern.view_pair) { ++ const int trace_idx = moe_whole_pattern_early_trace_count.fetch_add(1, std::memory_order_relaxed); ++ if (trace_idx < whole_pattern_early_trace_limit) { ++ const ggml_tensor * down_w = pattern.down != nullptr ? pattern.down->src[0] : nullptr; ++ const ggml_tensor * down_x = pattern.down != nullptr ? pattern.down->src[1] : nullptr; ++ fprintf(stderr, ++ "[LLAMA_MOE_WHOLE_PATTERN_EARLY] supported=%d skip_ready=%d gate_up=%s gate=%s up=%s glu=%s down=%s ids=%s type=%s" ++ " n_tokens=%" PRId64 " n_used=%" PRId64 " experts=%" PRId64 ++ " n_embd=%" PRId64 " n_ff=%" PRId64 ++ " ids_match=%d swiglu=%d\n", ++ pattern.supported ? 1 : 0, ++ pattern.supported ? 4 : 0, ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.gate_up), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.gate), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.up), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.glu), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.down), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.ids), ++ down_w != nullptr ? ggml_type_name(down_w->type) : "-", ++ down_x != nullptr ? down_x->ne[2] : 0, ++ pattern.ids != nullptr ? pattern.ids->ne[0] : 0, ++ down_w != nullptr ? down_w->ne[2] : 0, ++ down_w != nullptr ? down_w->ne[1] : 0, ++ down_w != nullptr ? down_w->ne[0] : 0, ++ pattern.ids_match ? 1 : 0, ++ pattern.swiglu ? 1 : 0); ++ } ++ } ++ ++ const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ const bool poc_supported = routed_ffn_poc && ggml_cuda_moe_routed_ffn_poc_should_engage( ++ pattern.gate_up, pattern.gate, pattern.up, pattern.glu, pattern.down, pattern.ids, cc); ++ ++ if ((poc_supported || (whole_pattern_exec && pattern.supported))) { ++ const bool ok = poc_supported ? ++ ggml_cuda_moe_routed_ffn_poc( ++ *cuda_ctx, ++ const_cast(pattern.gate_up), ++ const_cast(pattern.gate), ++ const_cast(pattern.up), ++ const_cast(pattern.glu), ++ const_cast(pattern.down)) : ++ ggml_cuda_moe_whole_pattern_exec_proof(cuda_ctx, pattern); ++ GGML_ASSERT(ok); ++ ++ static std::atomic moe_whole_pattern_exec_trace_count{0}; ++ if (ggml_cuda_moe_whole_pattern_exec_trace_take(moe_whole_pattern_exec_trace_count)) { ++ const ggml_tensor * down_w = pattern.down != nullptr ? pattern.down->src[0] : nullptr; ++ const ggml_tensor * down_x = pattern.down != nullptr ? pattern.down->src[1] : nullptr; ++ fprintf(stderr, ++ "[LLAMA_MOE_WHOLE_PATTERN_EXEC] skip=4 gate_up=%s glu=%s down=%s ids=%s" ++ " n_tokens=%" PRId64 " n_used=%" PRId64 " experts=%" PRId64 "\n", ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.gate_up), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.glu), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.down), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.ids), ++ down_x != nullptr ? down_x->ne[2] : 0, ++ pattern.ids != nullptr ? pattern.ids->ne[0] : 0, ++ down_w != nullptr ? down_w->ne[2] : 0); ++ } ++ ++ return 4; ++ } ++ } ++ ++ if (node->op == GGML_OP_GLU && i + 1 < cgraph->n_nodes && cgraph->nodes[i + 1]->op == GGML_OP_MUL_MAT_ID) { ++ static std::atomic moe_whole_pattern_trace_count{0}; ++ const bool whole_trace = ggml_cuda_moe_whole_pattern_trace_take(moe_whole_pattern_trace_count); ++ ++ if (whole_trace) { ++ const ggml_tensor * down = cgraph->nodes[i + 1]; ++ const ggml_cuda_moe_whole_pattern pattern = ggml_cuda_moe_whole_pattern_detect(node, down); ++ ++ const ggml_tensor * down_w = pattern.down != nullptr ? pattern.down->src[0] : nullptr; ++ const ggml_tensor * down_x = pattern.down != nullptr ? pattern.down->src[1] : nullptr; ++ fprintf(stderr, ++ "[LLAMA_MOE_WHOLE_PATTERN] supported=%d gate_up=%s gate=%s up=%s glu=%s down=%s ids=%s type=%s" ++ " n_tokens=%" PRId64 " n_used=%" PRId64 " experts=%" PRId64 ++ " n_embd=%" PRId64 " n_ff=%" PRId64 ++ " view_pair=%d ids_match=%d swiglu=%d\n", ++ pattern.supported ? 1 : 0, ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.gate_up), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.gate), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.up), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.glu), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.down), ++ ggml_cuda_moe_wp_trace_tensor_name(pattern.ids), ++ down_w != nullptr ? ggml_type_name(down_w->type) : "-", ++ down_x != nullptr ? down_x->ne[2] : 0, ++ pattern.ids != nullptr ? pattern.ids->ne[0] : 0, ++ down_w != nullptr ? down_w->ne[2] : 0, ++ down_w != nullptr ? down_w->ne[1] : 0, ++ down_w != nullptr ? down_w->ne[0] : 0, ++ pattern.view_pair ? 1 : 0, ++ pattern.ids_match ? 1 : 0, ++ pattern.swiglu ? 1 : 0); ++ } ++ } ++ + //topk-moe + if (cgraph->nodes[i]->op == GGML_OP_UNARY || cgraph->nodes[i]->op == GGML_OP_SOFT_MAX || + cgraph->nodes[i]->op == GGML_OP_ARGSORT) { +diff --git a/ggml/src/ggml-cuda/mmq.cu b/ggml/src/ggml-cuda/mmq.cu +index dc5c2d198..d8f39d395 100644 +--- a/ggml/src/ggml-cuda/mmq.cu ++++ b/ggml/src/ggml-cuda/mmq.cu +@@ -1,4 +1,7 @@ + #include ++#include ++#include ++#include + #include "common.cuh" + #include "mmq.cuh" + #include "quantize.cuh" +@@ -75,6 +78,151 @@ static void ggml_cuda_mul_mat_q_switch_type(ggml_backend_cuda_context & ctx, con + } + } + ++static inline int ggml_cuda_quant_trace_limit() { ++ static const int value = []() { ++ const char * s = getenv("LLAMA_QUANT_TRACE"); ++ return s ? atoi(s) : 0; ++ }(); ++ ++ return value; ++} ++ ++static inline const char * ggml_cuda_quant_trace_tensor_name(const ggml_tensor * t) { ++ return t != nullptr && t->name[0] != '\0' ? t->name : "-"; ++} ++ ++static inline void ggml_cuda_quant_trace( ++ const char * route, const ggml_tensor * src0, const ggml_tensor * src1, ++ const ggml_tensor * ids, const ggml_tensor * dst, const int native_fp4, ++ const int dedup, const int gathered, const int64_t ne10, const int64_t ne10_padded, ++ const int64_t rows, const int64_t ne12, const int64_t n_expert_used) { ++ const int trace_limit = ggml_cuda_quant_trace_limit(); ++ if (trace_limit <= 0) { ++ return; ++ } ++ ++ static std::atomic trace_count{0}; ++ const int trace_idx = trace_count.fetch_add(1, std::memory_order_relaxed); ++ if (trace_idx >= trace_limit) { ++ return; ++ } ++ ++ fprintf(stderr, ++ "[LLAMA_QUANT_TRACE] route=%s src0=%s src0_type=%s src1=%s src1_type=%s dst=%s dst_type=%s ids=%s " ++ "native_fp4=%d dedup=%d gathered=%d K=%" PRId64 " Kpad=%" PRId64 " rows=%" PRId64 ++ " ne12=%" PRId64 " experts=%" PRId64 "\n", ++ route, ggml_cuda_quant_trace_tensor_name(src0), src0 != nullptr ? ggml_type_name(src0->type) : "-", ++ ggml_cuda_quant_trace_tensor_name(src1), src1 != nullptr ? ggml_type_name(src1->type) : "-", ++ ggml_cuda_quant_trace_tensor_name(dst), dst != nullptr ? ggml_type_name(dst->type) : "-", ++ ggml_cuda_quant_trace_tensor_name(ids), native_fp4, dedup, gathered, ++ ne10, ne10_padded, rows, ne12, n_expert_used); ++} ++ ++ggml_cuda_mmq_ids_meta::ggml_cuda_mmq_ids_meta(ggml_cuda_pool & pool, int64_t ne_get_rows, int64_t n_experts) ++ : ne_get_rows(ne_get_rows) { ++ alloc(pool, ne_get_rows, n_experts); ++} ++ ++void ggml_cuda_mmq_ids_meta::alloc(ggml_cuda_pool & pool, int64_t ne_get_rows, int64_t n_experts) { ++ ids_src1.alloc(pool, ne_get_rows); ++ ids_dst.alloc(pool, ne_get_rows); ++ expert_bounds.alloc(pool, n_experts + 1); ++ this->ne_get_rows = ne_get_rows; ++} ++ ++void ggml_cuda_mmq_ids_meta::build( ++ const ggml_tensor * ids, int64_t n_experts, int64_t n_tokens, ++ int64_t n_expert_used, int64_t nchannels_y, int64_t sis1, ++ cudaStream_t stream) { ++ GGML_ASSERT(ids->nb[0] == ggml_element_size(ids)); ++ const int si1 = ids->nb[1] / ggml_element_size(ids); ++ ++ ggml_cuda_launch_mm_ids_helper( ++ (const int32_t *) ids->data, ids_src1.get(), ids_dst.get(), expert_bounds.get(), ++ n_experts, n_tokens, n_expert_used, nchannels_y, si1, sis1, stream); ++ CUDA_CHECK(cudaGetLastError()); ++} ++ ++static void ggml_cuda_mul_mat_q_moe_quantized_impl( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, const void * src1_q, ggml_tensor * dst, ++ const int32_t * ids_dst, const int32_t * expert_bounds, ++ int64_t n_tokens, int64_t n_expert_used, int64_t n_experts, ++ int64_t ncols_src1_padded) { ++ GGML_ASSERT( dst->type == GGML_TYPE_F32); ++ GGML_ASSERT( src1_q != nullptr); ++ GGML_ASSERT( ids_dst != nullptr); ++ GGML_ASSERT( expert_bounds != nullptr); ++ GGML_ASSERT( n_tokens > 0); ++ GGML_ASSERT( n_expert_used > 0); ++ GGML_ASSERT( n_experts > 0); ++ GGML_ASSERT( ncols_src1_padded > 0); ++ ++ const int64_t ne00 = src0->ne[0]; ++ const int64_t ne01 = src0->ne[1]; ++ const int64_t ne02 = src0->ne[2]; ++ const int64_t ne03 = src0->ne[3]; ++ const int64_t ne0 = dst->ne[0]; ++ const int64_t ne1 = dst->ne[1]; ++ const int64_t ne2 = dst->ne[2]; ++ const int64_t ne3 = dst->ne[3]; ++ const int64_t nb00 = src0->nb[0]; ++ const int64_t nb01 = src0->nb[1]; ++ const int64_t nb02 = src0->nb[2]; ++ const int64_t nb03 = src0->nb[3]; ++ const int64_t nb0 = dst->nb[0]; ++ const int64_t nb1 = dst->nb[1]; ++ const int64_t nb2 = dst->nb[2]; ++ const int64_t nb3 = dst->nb[3]; ++ ++ const size_t ts_src0 = ggml_type_size(src0->type); ++ const size_t ts_dst = ggml_type_size(dst->type); ++ ++ GGML_ASSERT(nb00 == (int64_t) ts_src0); ++ GGML_ASSERT(nb0 == (int64_t) ts_dst); ++ GGML_ASSERT(ne00 <= ncols_src1_padded); ++ GGML_ASSERT(ne01 == ne0); ++ GGML_ASSERT(ne02 == n_experts); ++ GGML_ASSERT(ne1 == n_expert_used); ++ GGML_ASSERT(ne2 == n_tokens); ++ ++ cudaStream_t stream = ctx.stream(); ++ const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ const bool use_stream_k = (GGML_CUDA_CC_IS_NVIDIA(cc) && ggml_cuda_highest_compiled_arch(cc) >= GGML_CUDA_CC_VOLTA) ++ || GGML_CUDA_CC_IS_CDNA(cc); ++ ++ const int64_t ne_get_rows = n_tokens * n_expert_used; ++ const int64_t s01 = nb01 / ts_src0; ++ const int64_t s1 = nb1 / ts_dst; ++ const int64_t s02 = nb02 / ts_src0; ++ const int64_t s2 = nb2 / ts_dst; ++ const int64_t s03 = nb03 / ts_src0; ++ const int64_t s3 = nb3 / ts_dst; ++ const int64_t s12 = ne_get_rows * ncols_src1_padded * sizeof(block_fp4_mmq) / (QK_K * sizeof(int)); ++ const int64_t s13 = n_tokens * s12; ++ ++ const mmq_args args = { ++ (const char *) src0->data, src0->type, (const int *) src1_q, ids_dst, expert_bounds, (float *) dst->data, ++ ne00, ne01, ne_get_rows, s01, ne_get_rows, s1, ++ n_experts, n_experts, s02, s12, s2, ++ ne03, ne3, s03, s13, s3, ++ use_stream_k, n_tokens}; ++ ++ ggml_cuda_quant_trace("mmq_moe_quantized_raw", src0, nullptr, nullptr, dst, 1, ++ 0, 0, ne00, ncols_src1_padded, ne_get_rows, n_tokens, n_expert_used); ++ ggml_cuda_mul_mat_q_switch_type(ctx, args, stream); ++} ++ ++void ggml_cuda_mul_mat_q_moe_quantized( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, const void * src1_q, ggml_tensor * dst, ++ const int32_t * ids_dst, const int32_t * expert_bounds, ++ int64_t n_tokens, int64_t n_expert_used, int64_t n_experts, ++ int64_t ncols_src1_padded) { ++ ggml_cuda_mul_mat_q_moe_quantized_impl(ctx, src0, src1_q, dst, ids_dst, expert_bounds, ++ n_tokens, n_expert_used, n_experts, ncols_src1_padded); ++} ++ + void ggml_cuda_mul_mat_q( + ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * ids, ggml_tensor * dst) { + GGML_ASSERT( src1->type == GGML_TYPE_F32); +diff --git a/ggml/src/ggml-cuda/mmq.cuh b/ggml/src/ggml-cuda/mmq.cuh +index b53e38a8b..690dc694a 100644 +--- a/ggml/src/ggml-cuda/mmq.cuh ++++ b/ggml/src/ggml-cuda/mmq.cuh +@@ -4334,6 +4334,31 @@ extern DECL_MMQ_CASE(GGML_TYPE_IQ4_XS); + void ggml_cuda_mul_mat_q( + ggml_backend_cuda_context & ctx, const ggml_tensor * src0, const ggml_tensor * src1, const ggml_tensor * ids, ggml_tensor * dst); + ++struct ggml_cuda_mmq_ids_meta { ++ ggml_cuda_pool_alloc ids_src1; ++ ggml_cuda_pool_alloc ids_dst; ++ ggml_cuda_pool_alloc expert_bounds; ++ ++ int64_t ne_get_rows = 0; ++ ++ ggml_cuda_mmq_ids_meta() = default; ++ ggml_cuda_mmq_ids_meta(ggml_cuda_pool & pool, int64_t ne_get_rows, int64_t n_experts); ++ ++ void alloc(ggml_cuda_pool & pool, int64_t ne_get_rows, int64_t n_experts); ++ ++ void build( ++ const ggml_tensor * ids, int64_t n_experts, int64_t n_tokens, ++ int64_t n_expert_used, int64_t nchannels_y, int64_t sis1, ++ cudaStream_t stream); ++}; ++ ++void ggml_cuda_mul_mat_q_moe_quantized( ++ ggml_backend_cuda_context & ctx, ++ const ggml_tensor * src0, const void * src1_q, ggml_tensor * dst, ++ const int32_t * ids_dst, const int32_t * expert_bounds, ++ int64_t n_tokens, int64_t n_expert_used, int64_t n_experts, ++ int64_t ncols_src1_padded); ++ + void ggml_cuda_op_mul_mat_q( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i, +diff --git a/ggml/src/ggml-cuda/moe-ffn.cu b/ggml/src/ggml-cuda/moe-ffn.cu +new file mode 100644 +index 000000000..507390d65 +--- /dev/null ++++ b/ggml/src/ggml-cuda/moe-ffn.cu +@@ -0,0 +1,296 @@ ++#include "moe-ffn.cuh" ++#include "getrows.cuh" ++#include "mmq.cuh" ++#include "unary.cuh" ++ ++#include ++ ++bool ggml_cuda_moe_routed_ffn_poc_enabled() { ++ static const bool enabled = [] { ++ const char * value = getenv("LLAMA_MOE_ROUTED_FFN_POC"); ++ return value != nullptr && atoi(value) != 0; ++ }(); ++ return enabled; ++} ++ ++bool ggml_cuda_moe_routed_ffn_poc_should_engage( ++ const ggml_tensor * gate_up, ++ const ggml_tensor * gate, ++ const ggml_tensor * up, ++ const ggml_tensor * glu, ++ const ggml_tensor * down, ++ const ggml_tensor * ids, ++ int cc) { ++ if (!blackwell_mma_available(cc)) { ++ return false; ++ } ++ if (gate_up == nullptr || gate == nullptr || up == nullptr || glu == nullptr || down == nullptr || ids == nullptr) { ++ return false; ++ } ++ if (gate_up->op != GGML_OP_MUL_MAT_ID || down->op != GGML_OP_MUL_MAT_ID) { ++ return false; ++ } ++ if (gate->op != GGML_OP_VIEW || up->op != GGML_OP_VIEW || gate->view_src != gate_up || up->view_src != gate_up) { ++ return false; ++ } ++ if (glu->op != GGML_OP_GLU || ggml_get_glu_op(glu) != GGML_GLU_OP_SWIGLU || down->src[1] != glu) { ++ return false; ++ } ++ if (gate_up->src[2] != ids || down->src[2] != ids) { ++ return false; ++ } ++ ++ const ggml_tensor * down_w = down->src[0]; ++ if (down_w == nullptr || (down_w->type != GGML_TYPE_NVFP4 && down_w->type != GGML_TYPE_MXFP4)) { ++ return false; ++ } ++ ++ return true; ++} ++ ++static bool ggml_cuda_moe_routed_ffn_fused_quant_enabled() { ++ static const bool enabled = [] { ++ const char * value = getenv("LLAMA_MOE_ROUTED_FFN_FUSED_QUANT"); ++ return value != nullptr && atoi(value) != 0; ++ }(); ++ return enabled; ++} ++ ++static bool ggml_cuda_moe_routed_ffn_down_supported(const ggml_tensor * glu, const ggml_tensor * down) { ++ const ggml_tensor * down_w = down != nullptr ? down->src[0] : nullptr; ++ const ggml_tensor * ids = down != nullptr ? down->src[2] : nullptr; ++ if (glu == nullptr || down == nullptr || down_w == nullptr || ids == nullptr) { ++ return false; ++ } ++ if (down_w->type != GGML_TYPE_NVFP4 && down_w->type != GGML_TYPE_MXFP4) { ++ return false; ++ } ++ if (glu->type != GGML_TYPE_F32 || down->type != GGML_TYPE_F32 || ids->type != GGML_TYPE_I32) { ++ return false; ++ } ++ if (glu->ne[3] != 1 || down->ne[3] != 1 || ids->ne[2] != 1 || ids->ne[3] != 1) { ++ return false; ++ } ++ if (down_w->ne[0] != glu->ne[0] || down_w->ne[1] != down->ne[0] || down_w->ne[2] <= 0) { ++ return false; ++ } ++ if (ids->ne[0] != glu->ne[1] || ids->ne[1] != glu->ne[2]) { ++ return false; ++ } ++ if (down->ne[1] != glu->ne[1] || down->ne[2] != glu->ne[2]) { ++ return false; ++ } ++ if (ids->nb[0] != ggml_element_size(ids)) { ++ return false; ++ } ++ if (glu->nb[0] != sizeof(float) || down->nb[0] != sizeof(float)) { ++ return false; ++ } ++ if (glu->nb[1] != (size_t) (glu->ne[0] * (int64_t) sizeof(float)) || ++ glu->nb[2] != (size_t) (glu->ne[1] * (int64_t) glu->nb[1])) { ++ return false; ++ } ++ if (down->nb[1] != (size_t) (down->ne[0] * (int64_t) sizeof(float)) || ++ down->nb[2] != (size_t) (down->ne[1] * (int64_t) down->nb[1])) { ++ return false; ++ } ++ ++ const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ return ggml_cuda_should_use_mmq(down_w->type, cc, glu->ne[2], down_w->ne[2]); ++} ++ ++static __global__ void moe_swiglu_nvfp4_quant_kernel( ++ const float * __restrict__ gate, ++ const float * __restrict__ up, ++ const int32_t * __restrict__ ids_src1, ++ void * __restrict__ vy, ++ int64_t n_ff, ++ int64_t n_ff_padded, ++ int64_t n_rows, ++ int64_t n_used, ++ int64_t gate_s1, ++ int64_t gate_s2, ++ int64_t up_s1, ++ int64_t up_s2) { ++#if defined(BLACKWELL_MMA_AVAILABLE) ++ const int64_t i0_base = ((int64_t) blockDim.x * blockIdx.y + threadIdx.x) * QK_NVFP4_SUB; ++ if (i0_base >= n_ff_padded) { ++ return; ++ } ++ ++ const int64_t row = blockIdx.x; ++ const int64_t src_row = ids_src1[row]; ++ const int64_t token = src_row / n_used; ++ const int64_t used = src_row - token * n_used; ++ const int64_t k_block = i0_base / QK_K; ++ const int64_t blocks_per_col = (n_ff_padded + QK_K - 1) / QK_K; ++ if (k_block >= blocks_per_col) { ++ return; ++ } ++ ++ block_fp4_mmq * y = (block_fp4_mmq *) vy; ++ block_fp4_mmq * yb = y + k_block * n_rows + row; ++ const int sub = (i0_base % QK_K) / QK_NVFP4_SUB; ++ ++ float vals_raw[QK_NVFP4_SUB]; ++ float amax_raw = 0.0f; ++#pragma unroll ++ for (int k = 0; k < QK_NVFP4_SUB; k++) { ++ const int64_t i0 = i0_base + k; ++ if (i0 < n_ff) { ++ const float g = gate[token * gate_s2 + used * gate_s1 + i0]; ++ const float u = up[token * up_s2 + used * up_s1 + i0]; ++ const float v = ggml_cuda_op_silu_single(g) * u; ++ vals_raw[k] = v; ++ amax_raw = fmaxf(amax_raw, fabsf(v)); ++ } else { ++ vals_raw[k] = 0.0f; ++ } ++ } ++ ++ static constexpr int test_offsets[5] = { 0, -1, 1, -2, 2 }; ++ const int first_fp8_code = (int) ggml_cuda_fp32_to_ue4m3(amax_raw / 6.0f); ++ ++ float best_err = FLT_MAX; ++ uint8_t fp8_code = 0; ++ float subblock_scale = 0.0f; ++ ++#pragma unroll ++ for (int i = 0; i < 5; i++) { ++ const int test_code = first_fp8_code + test_offsets[i]; ++ if (test_code < 0 || test_code > 0x7e) { ++ continue; ++ } ++ const uint8_t code = (uint8_t) test_code; ++ const float test_scale = ggml_cuda_ue4m3_to_fp32(code); ++ const float test_inv_scale = test_scale > 0.0f ? 0.5f / test_scale : 0.0f; ++ float cur_err = 0.0f; ++#pragma unroll ++ for (int k = 0; k < QK_NVFP4_SUB; ++k) { ++ const float v = vals_raw[k]; ++ const uint8_t q = ggml_cuda_float_to_fp4_e2m1(v, test_inv_scale); ++ const float err_diff = fabsf(v) - fabsf(kvalues_mxfp4[q & 0x7]) * test_scale; ++ cur_err = fmaf(err_diff, err_diff, cur_err); ++ } ++ ++ if (cur_err < best_err) { ++ best_err = cur_err; ++ fp8_code = test_code; ++ subblock_scale = test_scale; ++ } ++ } ++ ++ const float inv_scale = subblock_scale > 0.0f ? 0.5f / subblock_scale : 0.0f; ++ uint32_t q0 = 0; ++ uint32_t q1 = 0; ++#pragma unroll ++ for (int k = 0; k < QK_NVFP4_SUB / 4; ++k) { ++ q0 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 0], inv_scale) << (8 * k); ++ q0 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 8], inv_scale) << (8 * k + 4); ++ q1 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 4], inv_scale) << (8 * k); ++ q1 |= (uint32_t) ggml_cuda_float_to_fp4_e2m1(vals_raw[k + 12], inv_scale) << (8 * k + 4); ++ } ++ ++ uint32_t * yqs = reinterpret_cast(yb->qs); ++ yqs[2 * sub + 0] = q0; ++ yqs[2 * sub + 1] = q1; ++ reinterpret_cast(yb->d4)[sub] = fp8_code; ++#else ++ NO_DEVICE_CODE; ++#endif ++} ++ ++static bool ggml_cuda_moe_routed_ffn_fused_quant( ++ ggml_backend_cuda_context & ctx, ++ ggml_tensor * gate, ++ ggml_tensor * up, ++ ggml_tensor * glu, ++ ggml_tensor * down) { ++ if (!ggml_cuda_moe_routed_ffn_fused_quant_enabled()) { ++ return false; ++ } ++ if (!ggml_cuda_moe_routed_ffn_down_supported(glu, down)) { ++ return false; ++ } ++ const ggml_tensor * down_w = down->src[0]; ++ const ggml_tensor * ids = down->src[2]; ++ if (down_w->type != GGML_TYPE_NVFP4) { ++ return false; ++ } ++ if (gate == nullptr || up == nullptr || gate->type != GGML_TYPE_F32 || up->type != GGML_TYPE_F32) { ++ return false; ++ } ++ if (gate->ne[0] != glu->ne[0] || gate->ne[1] != glu->ne[1] || gate->ne[2] != glu->ne[2] || gate->ne[3] != glu->ne[3]) { ++ return false; ++ } ++ if (up->ne[0] != glu->ne[0] || up->ne[1] != glu->ne[1] || up->ne[2] != glu->ne[2] || up->ne[3] != glu->ne[3]) { ++ return false; ++ } ++ if (gate->nb[0] != sizeof(float) || up->nb[0] != sizeof(float)) { ++ return false; ++ } ++ ++ const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ if (!blackwell_mma_available(cc)) { ++ return false; ++ } ++ ++ const int64_t n_ff = glu->ne[0]; ++ const int64_t n_ff_padded = GGML_PAD(n_ff, MATRIX_ROW_PADDING); ++ if (n_ff % QK_NVFP4 != 0) { ++ return false; ++ } ++ ++ const int64_t n_expert_used = ids->ne[0]; ++ const int64_t n_tokens = glu->ne[2]; ++ const int64_t n_experts = down_w->ne[2]; ++ const int64_t ne_get_rows = n_tokens * n_expert_used; ++ ++ ggml_cuda_mmq_ids_meta ids_meta(ctx.pool(), ne_get_rows, n_experts); ++ const int64_t sis1 = glu->nb[2] / glu->nb[1]; ++ ids_meta.build(ids, n_experts, n_tokens, n_expert_used, glu->ne[1], sis1, ctx.stream()); ++ ++ const size_t nbytes_src1_q = ne_get_rows * n_ff_padded * sizeof(block_fp4_mmq) / QK_K + ++ get_mmq_x_max_host(cc) * sizeof(block_q8_1_mmq); ++ ggml_cuda_pool_alloc src1_q(ctx.pool(), nbytes_src1_q); ++ ++ constexpr int nvfp4_block_size = 128; ++ const int64_t block_num_y = (n_ff_padded + QK_NVFP4_SUB * nvfp4_block_size - 1) / (QK_NVFP4_SUB * nvfp4_block_size); ++ const dim3 block_size(nvfp4_block_size, 1, 1); ++ const dim3 num_blocks(ne_get_rows, block_num_y, 1); ++ moe_swiglu_nvfp4_quant_kernel<<>>( ++ (const float *) gate->data, (const float *) up->data, ids_meta.ids_src1.get(), src1_q.get(), ++ n_ff, n_ff_padded, ne_get_rows, n_expert_used, ++ gate->nb[1] / sizeof(float), gate->nb[2] / sizeof(float), ++ up->nb[1] / sizeof(float), up->nb[2] / sizeof(float)); ++ CUDA_CHECK(cudaGetLastError()); ++ ++ ggml_cuda_mul_mat_q_moe_quantized( ++ ctx, down_w, src1_q.get(), down, ++ ids_meta.ids_dst.get(), ids_meta.expert_bounds.get(), ++ n_tokens, n_expert_used, n_experts, n_ff_padded); ++ return true; ++} ++ ++bool ggml_cuda_moe_routed_ffn_poc( ++ ggml_backend_cuda_context & ctx, ++ ggml_tensor * gate_up, ++ ggml_tensor * gate, ++ ggml_tensor * up, ++ ggml_tensor * glu, ++ ggml_tensor * down) { ++ if (!ggml_cuda_compute_forward(ctx, gate_up)) { ++ return false; ++ } ++ if (ggml_cuda_moe_routed_ffn_fused_quant(ctx, gate, up, glu, down)) { ++ return true; ++ } ++ if (!ggml_cuda_compute_forward(ctx, glu)) { ++ return false; ++ } ++ if (!ggml_cuda_compute_forward(ctx, down)) { ++ return false; ++ } ++ ++ return true; ++} +diff --git a/ggml/src/ggml-cuda/moe-ffn.cuh b/ggml/src/ggml-cuda/moe-ffn.cuh +new file mode 100644 +index 000000000..ce385d31a +--- /dev/null ++++ b/ggml/src/ggml-cuda/moe-ffn.cuh +@@ -0,0 +1,24 @@ ++#pragma once ++ ++#include "common.cuh" ++ ++bool ggml_cuda_compute_forward(ggml_backend_cuda_context & ctx, ggml_tensor * dst); ++ ++bool ggml_cuda_moe_routed_ffn_poc_enabled(); ++ ++bool ggml_cuda_moe_routed_ffn_poc_should_engage( ++ const ggml_tensor * gate_up, ++ const ggml_tensor * gate, ++ const ggml_tensor * up, ++ const ggml_tensor * glu, ++ const ggml_tensor * down, ++ const ggml_tensor * ids, ++ int cc); ++ ++bool ggml_cuda_moe_routed_ffn_poc( ++ ggml_backend_cuda_context & ctx, ++ ggml_tensor * gate_up, ++ ggml_tensor * gate, ++ ggml_tensor * up, ++ ggml_tensor * glu, ++ ggml_tensor * down); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0053-feat-paged-P1-bf16-stream-residual-segment-executor-.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0053-feat-paged-P1-bf16-stream-residual-segment-executor-.patch new file mode 100644 index 000000000000..7f78a8b70e57 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0053-feat-paged-P1-bf16-stream-residual-segment-executor-.patch @@ -0,0 +1,922 @@ +From 1271488fc47d7db2319163d0b34601dd30d49250 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 2 Jul 2026 16:29:07 +0200 +Subject: [PATCH 53/55] feat(paged): P1 bf16-stream residual-segment executor + + norm-bf16 kernels + +Additive, default-off (LLAMA_BF16_STREAM=1) bf16-resident activation stream +for the q36 residual path, targeting prefill bucket 3 (the convert/glue tax). + +- norm-bf16.cu/.cuh: rms_norm, the 0042 pre_add_mul, and the 0044 gate_mul + norms templated on output dtype, bit-faithful to the f32 kernels up to the + __float2bfloat16 store. +- One additive clause in ggml_cuda_try_fuse detects a residual-stream + norm-producer whose consumers are all large-M cuBLAS-bf16 projections, runs + the norm into a bf16 pool buffer, executes the owned span inline through a + bf16 view, then skips it. Strict all-consumers-are-ours guard keeps the f32 + norm un-materialised and bails to the stock f32 path otherwise (small-M, + decode, MMQ, native-FP4, multi-consumer). +- LLAMA_BF16_CUBLAS_F32_OUT plank: owned projections write f32 directly from + bf16 tensor-core compute, skipping the bf16 dst pool + bf16->f32 convert; + the F32_OUT else-branch is byte-identical to the original cuBLAS path. + +Default md5 stays canonical with the code present-but-off and env-on (small-M +prompts bail): MoE 8cb0ce23777bf55f92f63d0292c756b0, dense 5951a5b4d624ce891e22ab5fca9bc439. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/ggml-cuda.cu | 297 +++++++++++++++++-- + ggml/src/ggml-cuda/norm-bf16.cu | 483 +++++++++++++++++++++++++++++++ + ggml/src/ggml-cuda/norm-bf16.cuh | 37 +++ + 3 files changed, 793 insertions(+), 24 deletions(-) + create mode 100644 ggml/src/ggml-cuda/norm-bf16.cu + create mode 100644 ggml/src/ggml-cuda/norm-bf16.cuh + +diff --git a/ggml/src/ggml-cuda/ggml-cuda.cu b/ggml/src/ggml-cuda/ggml-cuda.cu +index ef1bdc3b4..5626ccebb 100644 +--- a/ggml/src/ggml-cuda/ggml-cuda.cu ++++ b/ggml/src/ggml-cuda/ggml-cuda.cu +@@ -36,6 +36,7 @@ + #include "ggml-cuda/mmvf.cuh" + #include "ggml-cuda/mmvq.cuh" + #include "ggml-cuda/norm.cuh" ++#include "ggml-cuda/norm-bf16.cuh" + #include "ggml-cuda/opt-step-adamw.cuh" + #include "ggml-cuda/opt-step-sgd.cuh" + #include "ggml-cuda/out-prod.cuh" +@@ -1628,12 +1629,29 @@ static const cublas_force_compute_type & ggml_cuda_cublas_get_force_compute_type + return compute_type; + } + ++// [P1 bf16-stream] LLAMA_BF16_CUBLAS_F32_OUT plank. When set (by the bf16-stream ++// segment executor around an owned projection, or globally via the env), the cuBLAS ++// bf16/nvfp4 GEMM writes f32 directly from the bf16 tensor-core compute, skipping the ++// bf16 dst pool buffer + the bf16->f32 output convert_dtype. The result is the full ++// f32 GEMM accumulation (the current path rounds it to bf16 then widens back), so this ++// is a strictly-more-precise dtype change gated on the opt-in KL path, never md5. ++static thread_local bool g_bf16_stream_f32_out = false; ++static bool ggml_cuda_bf16_cublas_f32_out_env() { ++ static const bool e = [] { ++ const char * s = getenv("LLAMA_BF16_CUBLAS_F32_OUT"); ++ return s != nullptr && atoi(s) != 0; ++ }(); ++ return e; ++} ++ + static void ggml_cuda_op_mul_mat_cublas( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, const char * src0_dd_i, const float * src1_ddf_i, + const char * src1_ddq_i, float * dst_dd_i, const int64_t row_low, const int64_t row_high, const int64_t src1_ncols, + const int64_t src1_padded_row_size, cudaStream_t stream) { + ++ const bool bf16_stream_f32_out = g_bf16_stream_f32_out || ggml_cuda_bf16_cublas_f32_out_env(); ++ + GGML_ASSERT(src0_dd_i != nullptr); + GGML_ASSERT(src1_ddf_i != nullptr); + GGML_ASSERT(dst_dd_i != nullptr); +@@ -1686,23 +1704,34 @@ static void ggml_cuda_op_mul_mat_cublas( + } + const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get(); + const nv_bfloat16 * src0_ptr = src0_as_bf16.get(); +- ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); + + const float alpha_f32 = 1.0f; + const float beta_f32 = 0.0f; + + CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); +- CUBLAS_CHECK( +- cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, +- row_diff, src1_ncols, ne10, +- &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, +- src1_ptr, CUDA_R_16BF, ne10, +- &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, +- CUBLAS_COMPUTE_32F, +- CUBLAS_GEMM_DEFAULT_TENSOR_OP)); +- +- const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); +- to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ if (bf16_stream_f32_out) { ++ // [P1 bf16-stream] write f32 directly, skip the bf16 dst pool + convert. ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_dd_i, CUDA_R_32F, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ } else { ++ ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); ++ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ } + } else if (supports_bf16 && src0->type == GGML_TYPE_BF16 && ggml_is_contiguous(src0) && row_diff == src0->ne[1]) { + ggml_cuda_pool_alloc src1_as_bf16(ctx.pool(id)); + if (src1->type != GGML_TYPE_BF16) { +@@ -1714,23 +1743,34 @@ static void ggml_cuda_op_mul_mat_cublas( + } + const nv_bfloat16 * src1_ptr = src1->type == GGML_TYPE_BF16 ? (const nv_bfloat16 *) src1_ddf_i : src1_as_bf16.get(); + const nv_bfloat16 * src0_ptr = (const nv_bfloat16 *)src0_dd_i; +- ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); + + const float alpha_f32 = 1.0f; + const float beta_f32 = 0.0f; + + CUBLAS_CHECK(cublasSetStream(ctx.cublas_handle(id), stream)); +- CUBLAS_CHECK( +- cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, +- row_diff, src1_ncols, ne10, +- &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, +- src1_ptr, CUDA_R_16BF, ne10, +- &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, +- CUBLAS_COMPUTE_32F, +- CUBLAS_GEMM_DEFAULT_TENSOR_OP)); +- +- const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); +- to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ if (bf16_stream_f32_out) { ++ // [P1 bf16-stream] write f32 directly, skip the bf16 dst pool + convert. ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_dd_i, CUDA_R_32F, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ } else { ++ ggml_cuda_pool_alloc dst_bf16(ctx.pool(id), row_diff*src1_ncols); ++ CUBLAS_CHECK( ++ cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, ++ row_diff, src1_ncols, ne10, ++ &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, ++ src1_ptr, CUDA_R_16BF, ne10, ++ &beta_f32, dst_bf16.get(), CUDA_R_16BF, ldc, ++ CUBLAS_COMPUTE_32F, ++ CUBLAS_GEMM_DEFAULT_TENSOR_OP)); ++ const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(GGML_TYPE_BF16); ++ to_fp32_cuda(dst_bf16.get(), dst_dd_i, row_diff*src1_ncols, stream); ++ } + } else if (fast_fp16_hardware_available(cc) && use_fp16) { + // convert src0 and src1 to fp16, multiply as fp16, convert dst to fp32 + ggml_cuda_pool_alloc src0_as_f16(ctx.pool(id)); +@@ -4706,6 +4746,215 @@ static int ggml_cuda_try_fuse(ggml_backend_cuda_context * cuda_ctx, ggml_cgraph + return 2; + } + ++ // [P1 bf16-stream] Generalized additive segment executor (LLAMA_BF16_STREAM=1, ++ // default off). ONE clause; the residual-stream segment is detected inside it. ++ // Owns any norm-producer whose consumers are ALL large-M cuBLAS-bf16 projections and ++ // runs that norm into a bf16 pool buffer so every projection reads the bf16 ++ // activation directly - no per-op f32->bf16 convert_dtype glue. Two live q36 kinds: ++ // * plain rms_norm+mul {RMS_NORM,MUL} -> BF16 q/k/v / GDN in_proj (may be ++ // multi-consumer: q,k,v share it) ++ // * 0044 gated-DeltaNet output norm {SILU,RMS_NORM,MUL,MUL} -> ssm_out (the P0 seg) ++ // (The 0042 {ADD,RMS_NORM,MUL} residual-fused norm is handled by its f32 clause below ++ // and, on q36, feeds the NVFP4-MMQ experts, so a bf16 stream there would bail; its ++ // bf16 variant lives in norm-bf16.cu for op-set completeness.) ++ // ++ // Correctness: strict all-consumers-are-ours guard - the f32 norm output is never ++ // materialised, so every node that transitively reads it must be one of our owned ++ // projections (as src1); any other reader, or an unrelated compute node inside the ++ // skipped span, bails and the f32 fused-norm path runs unchanged. Each projection is ++ // executed inline through a bf16 view of the shared buffer; the whole owned span ++ // (norm nodes + intervening pure-view no-ops + the projections) is then skipped. The ++ // LLAMA_BF16_CUBLAS_F32_OUT plank additionally makes the owned projections write f32 ++ // directly (skipping the dst convert). Env-off path and decode/small-M md5 untouched. ++ static const bool bf16_stream = [] { ++ const char * e = getenv("LLAMA_BF16_STREAM"); ++ return e != nullptr && atoi(e) != 0; ++ }(); ++ static const int bf16_stream_trace = [] { ++ const char * e = getenv("LLAMA_BF16_STREAM_TRACE"); ++ return e != nullptr ? atoi(e) : 0; ++ }(); ++ static const bool bf16_stream_f32_out_default = [] { ++ const char * e = getenv("LLAMA_BF16_CUBLAS_F32_OUT"); ++ return e == nullptr || atoi(e) != 0; // plank ON by default when a segment engages ++ }(); ++ if (bf16_stream) { ++ // ---- detect the norm-producer kind + the f32 activation tensor + node span ---- ++ int kind = 0; // 1=plain rms+mul, 2=gated-DeltaNet output norm ++ int norm_span = 0; ++ const char * seg_kind = nullptr; ++ ggml_tensor * k_rms = nullptr, * k_mul = nullptr, * k_silu = nullptr; ++ ggml_tensor * norm_out = nullptr; ++ if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_UNARY, GGML_OP_RMS_NORM, GGML_OP_MUL, GGML_OP_MUL }, { GGML_UNARY_OP_SILU })) { ++ kind = 2; k_silu = cgraph->nodes[i]; k_rms = cgraph->nodes[i + 1]; k_mul = cgraph->nodes[i + 2]; ++ norm_out = cgraph->nodes[i + 3]; norm_span = 4; seg_kind = "gate_norm"; ++ } else if (ggml_cuda_can_fuse(cgraph, i, { GGML_OP_RMS_NORM, GGML_OP_MUL }, {})) { ++ kind = 1; k_rms = cgraph->nodes[i]; k_mul = cgraph->nodes[i + 1]; ++ norm_out = cgraph->nodes[i + 1]; norm_span = 2; seg_kind = "rms_norm"; ++ } ++ ++ if (kind != 0) { ++ const int cc = ggml_cuda_info().devices[ggml_cuda_get_device()].cc; ++ const int norm_end = i + norm_span; ++ ++ // follow a pure view/reshape chain up to norm_out ++ auto roots_at = [](const ggml_tensor * t, const ggml_tensor * root) -> bool { ++ const ggml_tensor * c = t; ++ for (int d = 0; d < 8 && c != nullptr; ++d) { ++ if (c == root) return true; ++ if (c->view_src) { c = c->view_src; continue; } ++ if ((c->op == GGML_OP_RESHAPE || c->op == GGML_OP_VIEW || c->op == GGML_OP_PERMUTE || ++ c->op == GGML_OP_TRANSPOSE || c->op == GGML_OP_CONT) && c->src[0]) { c = c->src[0]; continue; } ++ break; ++ } ++ return false; ++ }; ++ // Metadata-only no-ops (match the stock capture loop's skip set). CONT is ++ // NOT here: it materializes a contiguous copy, so a CONT of norm_out must fall ++ // through to the roots_at check below and bail (it would need the f32 norm). ++ auto is_pure_view = [](const ggml_tensor * t) -> bool { ++ return t->op == GGML_OP_RESHAPE || t->op == GGML_OP_VIEW || t->op == GGML_OP_PERMUTE || ++ t->op == GGML_OP_TRANSPOSE || t->op == GGML_OP_NONE; ++ }; ++ // ownable large-M cuBLAS-bf16 projection whose src1 is the FULL norm output ++ auto is_owned_proj = [&](const ggml_tensor * p) -> bool { ++ if (p->op != GGML_OP_MUL_MAT) return false; ++ const ggml_tensor * w = p->src[0]; ++ const ggml_tensor * x1 = p->src[1]; ++ if (!w || !x1) return false; ++ if (!(x1 == norm_out || x1->view_src == norm_out || ++ (x1->op == GGML_OP_RESHAPE && x1->src[0] == norm_out))) return false; ++ if (ggml_nelements(x1) != ggml_nelements(norm_out)) return false; // full, offset 0 ++ return (w->type == GGML_TYPE_BF16 || w->type == GGML_TYPE_NVFP4) && ggml_is_contiguous(w) && ++ p->type == GGML_TYPE_F32 && ++ x1->ne[2] == 1 && x1->ne[3] == 1 && ++ x1->ne[1] >= 128 && ++ !ggml_cuda_fp4_prefill_should_engage(w, x1, const_cast(p), cc) && ++ !ggml_cuda_should_use_mmq(w->type, cc, x1->ne[1], /*n_experts=*/0); ++ }; ++ ++ // Scan the rest of the graph: collect our projections, enforce that every ++ // consumer of norm_out is one of them, and that the skipped span holds only ++ // pure views / our projections. ++ bool ok = true; ++ int n_proj = 0; ++ int max_proj_idx = -1; ++ const char * miss_reason = "unknown"; ++ int miss_node = -1; ++ const char * miss_op = ""; ++ ggml_tensor * projs[16]; ++ for (int j = norm_end; j < cgraph->n_nodes && ok; ++j) { ++ ggml_tensor * nj = cgraph->nodes[j]; ++ // Pure view/reshape no-ops are part of the src1 view chain (or unrelated ++ // metadata ops): they carry no kernel and are re-expressed by the inline ++ // bf16 src1, so they never force f32 materialization. A *real* downstream ++ // consumer that reads norm_out through such a view is still caught below, ++ // because roots_at() climbs the view chain to norm_out. ++ if (is_pure_view(nj)) { ++ continue; ++ } ++ if (is_owned_proj(nj)) { ++ if (n_proj < 16) { projs[n_proj] = nj; } ++ n_proj++; ++ max_proj_idx = j; ++ continue; ++ } ++ // any (non-view, non-projection) reader of norm_out disqualifies the segment ++ for (int s = 0; s < GGML_MAX_SRC; ++s) { ++ if (nj->src[s] && roots_at(nj->src[s], norm_out)) { ++ ok = false; miss_reason = "nonproj_consumer"; miss_node = j; miss_op = ggml_op_name(nj->op); break; ++ } ++ } ++ } ++ // require projections, room in the fixed buffer, and a bounded span. The ++ // span [norm_end, max_proj_idx] may hold non-projection compute (q36 QK-norm / ++ // scale on the projection outputs); those never read norm_out (enforced above) ++ // so the whole span is executed inline in graph order below - owned projections ++ // through the bf16 buffer, everything else via the stock per-node executor - ++ // and then skipped as one unit. ++ const int span_len = max_proj_idx - norm_end; ++ if (!(n_proj >= 1 && n_proj <= 16 && span_len <= 96)) { ++ if (ok) { miss_reason = (n_proj == 0) ? "no_owned_proj" : (n_proj > 16 ? "too_many_proj" : "span_too_long"); } ++ ok = false; ++ } ++ ++ if (ok) { ++ const int64_t ne_tot = ggml_nelements(norm_out); ++ ggml_cuda_pool_alloc norm_bf16(cuda_ctx->pool(), ne_tot); ++ if (kind == 2) { ++ ggml_cuda_rms_norm_gate_mul_bf16out(*cuda_ctx, k_rms, k_mul, k_silu, norm_out, norm_bf16.get()); ++ } else { ++ ggml_cuda_rms_norm_mul_bf16out(*cuda_ctx, k_rms, k_mul, norm_bf16.get()); ++ } ++ ++ // Execute the whole owned span inline, in graph order (mirrors the stock ++ // capture loop's per-node handling for the non-owned nodes). ++ for (int j = norm_end; j <= max_proj_idx; ++j) { ++ ggml_tensor * nj = cgraph->nodes[j]; ++ ++ bool mine = false; ++ for (int p = 0; p < n_proj; ++p) { if (projs[p] == nj) { mine = true; break; } } ++ ++ if (mine) { ++ ggml_tensor * proj_src1 = nj->src[1]; ++ ggml_tensor src1_bf16 = *proj_src1; ++ src1_bf16.type = GGML_TYPE_BF16; ++ src1_bf16.data = norm_bf16.get(); ++ src1_bf16.view_src = nullptr; ++ src1_bf16.view_offs = 0; ++ src1_bf16.nb[0] = sizeof(nv_bfloat16); ++ src1_bf16.nb[1] = src1_bf16.nb[0] * src1_bf16.ne[0]; ++ src1_bf16.nb[2] = src1_bf16.nb[1] * src1_bf16.ne[1]; ++ src1_bf16.nb[3] = src1_bf16.nb[2] * src1_bf16.ne[2]; ++ ++ ggml_tensor * saved_src1 = nj->src[1]; ++ nj->src[1] = &src1_bf16; ++ g_bf16_stream_f32_out = bf16_stream_f32_out_default; ++ const bool okc = ggml_cuda_compute_forward(*cuda_ctx, nj); ++ g_bf16_stream_f32_out = false; ++ nj->src[1] = saved_src1; ++ GGML_ASSERT(okc); ++ continue; ++ } ++ ++ // non-owned span node: mirror the stock loop (skip metadata no-ops, ++ // run the rest through the per-node executor) ++ if (ggml_is_empty(nj) || nj->op == GGML_OP_RESHAPE || nj->op == GGML_OP_TRANSPOSE || ++ nj->op == GGML_OP_VIEW || nj->op == GGML_OP_PERMUTE || nj->op == GGML_OP_NONE) { ++ continue; ++ } ++ if ((nj->flags & GGML_TENSOR_FLAG_COMPUTE) == 0) { ++ continue; ++ } ++ const bool okn = ggml_cuda_compute_forward(*cuda_ctx, nj); ++ GGML_ASSERT(okn); ++ } ++ ++ static std::atomic bf16_stream_engage_count{0}; ++ const int ec = bf16_stream_engage_count.fetch_add(1, std::memory_order_relaxed); ++ if (bf16_stream_trace > 0 && ec < bf16_stream_trace) { ++ const ggml_tensor * w0 = projs[0]->src[0]; ++ fprintf(stderr, ++ "[LLAMA_BF16_STREAM] engaged seg=%s node=%d n_proj=%d last_proj=%d " ++ "M=%" PRId64 " N=%" PRId64 " K=%" PRId64 " f32out=%d skip=%d\n", ++ seg_kind, i, n_proj, max_proj_idx, projs[0]->src[1]->ne[1], w0->ne[1], w0->ne[0], ++ bf16_stream_f32_out_default ? 1 : 0, max_proj_idx - i); ++ } ++ return max_proj_idx - i; // skip norm nodes + intervening views + all owned projections ++ } ++ ++ if (bf16_stream_trace > 0) { ++ static std::atomic bf16_stream_miss_count{0}; ++ const int mc = bf16_stream_miss_count.fetch_add(1, std::memory_order_relaxed); ++ if (mc < bf16_stream_trace) { ++ fprintf(stderr, ++ "[LLAMA_BF16_STREAM] miss seg=%s node=%d n_proj=%d reason=%s miss_node=%d miss_op=%s\n", ++ seg_kind, i, n_proj, miss_reason, miss_node, miss_op); ++ } ++ } ++ } ++ } ++ + // Fused gated RMS norm: RMS norm + weight multiply + SiLU-gated multiply + // (bit-exact). The Qwen3.6 gated-DeltaNet output norm. Default ON; set + // LLAMA_FUSE_GATE_RMSNORM=0 for a clean A/B against the unfused path. +diff --git a/ggml/src/ggml-cuda/norm-bf16.cu b/ggml/src/ggml-cuda/norm-bf16.cu +new file mode 100644 +index 000000000..77ccbf80e +--- /dev/null ++++ b/ggml/src/ggml-cuda/norm-bf16.cu +@@ -0,0 +1,483 @@ ++#include "norm-bf16.cuh" ++ ++#include ++ ++// [P1 bf16-stream] bf16-output variants of the residual-stream norms. Same reduction ++// and FP order as the f32 kernels in norm.cu; only the final store is narrowed. Kept ++// bit-faithful to the f32 norms up to the __float2bfloat16 store so the opt-in stream ++// stays a pure dtype (KL-benign) change, not an algorithmic one. ++ ++// Output-store policy: identity for float, round-to-nearest bf16 for nv_bfloat16. ++template struct bf16stream_store; ++template <> struct bf16stream_store { ++ static __device__ __forceinline__ float store(float v) { return v; } ++}; ++template <> struct bf16stream_store { ++ static __device__ __forceinline__ nv_bfloat16 store(float v) { return __float2bfloat16(v); } ++}; ++ ++// --------------------------------------------------------------------------- ++// plain rms_norm + weight multiply -> Tdst (mirrors rms_norm_f32) ++// --------------------------------------------------------------------------- ++template ++static __global__ void rms_norm_mul_out(const float * x, ++ Tdst * dst, ++ const int ncols, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float eps, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint3 mul_ncols_packed, ++ const uint3 mul_nrows_packed, ++ const uint3 mul_nchannels_packed, ++ const uint3 mul_nsamples_packed) { ++ ggml_cuda_pdl_lc(); ++ const int nrows = gridDim.x; ++ const int nchannels = gridDim.y; ++ ++ const int row = blockIdx.x; ++ const int channel = blockIdx.y; ++ const int sample = blockIdx.z; ++ const int tid = threadIdx.x; ++ ++ x += sample*stride_sample + channel*stride_channel + row*stride_row; ++ dst += ((sample*nchannels + channel)*nrows + row)*ncols; ++ ++ { ++ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed); ++ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed); ++ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed); ++ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row; ++ } ++ ++ float tmp = 0.0f; ++ ++ ggml_cuda_pdl_sync(); ++ for (int col = tid; col < ncols; col += block_size) { ++ const float xi = x[col]; ++ tmp += xi * xi; ++ } ++ ++ extern __shared__ float s_sum[]; ++ tmp = block_reduce(tmp, s_sum); ++ ++ const float mean = tmp / ncols; ++ const float scale = rsqrtf(mean + eps); ++ ++ for (int col = tid; col < ncols; col += block_size) { ++ const int mul_col = fastmodulo(col, mul_ncols_packed); ++ dst[col] = bf16stream_store::store(scale * x[col] * mul[mul_col]); ++ } ++} ++ ++// --------------------------------------------------------------------------- ++// 0042 residual-add + rms_norm + weight multiply -> f32 h_out + Tdst dst ++// (mirrors rms_norm_pre_add_mul_f32; h_out stays f32 so the next ++// residual add reads the same f32 residual stream) ++// --------------------------------------------------------------------------- ++template ++static __global__ void rms_norm_pre_add_mul_out(const float * a, ++ const float * b, ++ float * h_out, ++ Tdst * dst, ++ const int ncols, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float eps, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint3 mul_ncols_packed, ++ const uint3 mul_nrows_packed, ++ const uint3 mul_nchannels_packed, ++ const uint3 mul_nsamples_packed) { ++ ggml_cuda_pdl_lc(); ++ const int nrows = gridDim.x; ++ const int nchannels = gridDim.y; ++ ++ const int row = blockIdx.x; ++ const int channel = blockIdx.y; ++ const int sample = blockIdx.z; ++ const int tid = threadIdx.x; ++ ++ const int64_t row_offset = sample*stride_sample + channel*stride_channel + row*stride_row; ++ a += row_offset; ++ b += row_offset; ++ h_out += row_offset; ++ dst += ((sample*nchannels + channel)*nrows + row)*ncols; ++ ++ { ++ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed); ++ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed); ++ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed); ++ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row; ++ } ++ ++ float tmp = 0.0f; ++ ++ ggml_cuda_pdl_sync(); ++ for (int col = tid; col < ncols; col += block_size) { ++ const float hi = a[col] + b[col]; ++ h_out[col] = hi; // publish the f32 residual stream for the next add ++ tmp += hi * hi; ++ } ++ ++ extern __shared__ float s_sum[]; ++ tmp = block_reduce(tmp, s_sum); ++ ++ const float mean = tmp / ncols; ++ const float scale = rsqrtf(mean + eps); ++ ++ for (int col = tid; col < ncols; col += block_size) { ++ const float hi = h_out[col]; ++ const int mul_col = fastmodulo(col, mul_ncols_packed); ++ dst[col] = bf16stream_store::store(scale * hi * mul[mul_col]); ++ } ++} ++ ++// --------------------------------------------------------------------------- ++// 0044 gated-DeltaNet output norm scale*x*w*silu(z) -> Tdst (the P0 segment) ++// --------------------------------------------------------------------------- ++template ++static __global__ void rms_norm_gate_mul_out(const float * x, ++ Tdst * dst, ++ const int ncols, ++ const int64_t stride_row, ++ const int64_t stride_channel, ++ const int64_t stride_sample, ++ const float eps, ++ const float * mul, ++ const int64_t mul_stride_row, ++ const int64_t mul_stride_channel, ++ const int64_t mul_stride_sample, ++ const uint3 mul_ncols_packed, ++ const uint3 mul_nrows_packed, ++ const uint3 mul_nchannels_packed, ++ const uint3 mul_nsamples_packed, ++ const float * gate, ++ const int64_t gate_stride_row, ++ const int64_t gate_stride_channel, ++ const int64_t gate_stride_sample, ++ const uint3 gate_ncols_packed, ++ const uint3 gate_nrows_packed, ++ const uint3 gate_nchannels_packed, ++ const uint3 gate_nsamples_packed) { ++ ggml_cuda_pdl_lc(); ++ const int nrows = gridDim.x; ++ const int nchannels = gridDim.y; ++ ++ const int row = blockIdx.x; ++ const int channel = blockIdx.y; ++ const int sample = blockIdx.z; ++ const int tid = threadIdx.x; ++ ++ x += sample*stride_sample + channel*stride_channel + row*stride_row; ++ dst += ((sample*nchannels + channel)*nrows + row)*ncols; ++ ++ { ++ const uint32_t mul_row = fastmodulo(row, mul_nrows_packed); ++ const uint32_t mul_channel = fastmodulo(channel, mul_nchannels_packed); ++ const uint32_t mul_sample = fastmodulo(sample, mul_nsamples_packed); ++ mul += mul_sample * mul_stride_sample + mul_channel * mul_stride_channel + mul_row * mul_stride_row; ++ } ++ { ++ const uint32_t gate_row = fastmodulo(row, gate_nrows_packed); ++ const uint32_t gate_channel = fastmodulo(channel, gate_nchannels_packed); ++ const uint32_t gate_sample = fastmodulo(sample, gate_nsamples_packed); ++ gate += gate_sample * gate_stride_sample + gate_channel * gate_stride_channel + gate_row * gate_stride_row; ++ } ++ ++ float tmp = 0.0f; ++ ++ ggml_cuda_pdl_sync(); ++ for (int col = tid; col < ncols; col += block_size) { ++ const float xi = x[col]; ++ tmp += xi * xi; ++ } ++ ++ extern __shared__ float s_sum[]; ++ tmp = block_reduce(tmp, s_sum); ++ ++ const float mean = tmp / ncols; ++ const float scale = rsqrtf(mean + eps); ++ ++ for (int col = tid; col < ncols; col += block_size) { ++ const int mul_col = fastmodulo(col, mul_ncols_packed); ++ const int gate_col = fastmodulo(col, gate_ncols_packed); ++ const float zi = gate[gate_col]; ++ const float silu_z = zi / (1.0f + expf(-zi)); ++ dst[col] = bf16stream_store::store(scale * x[col] * mul[mul_col] * silu_z); ++ } ++} ++ ++// =========================================================================== ++// launchers ++// =========================================================================== ++template ++static void rms_norm_mul_out_cuda(const float * x, Tdst * dst, ++ const int ncols, const int nrows, const int nchannels, const int nsamples, ++ const int64_t stride_row, const int64_t stride_channel, const int64_t stride_sample, ++ const float * mul, ++ const int64_t mul_stride_row, const int64_t mul_stride_channel, const int64_t mul_stride_sample, ++ const uint32_t mul_ncols, const uint32_t mul_nrows, const uint32_t mul_nchannels, const uint32_t mul_nsamples, ++ const float eps, cudaStream_t stream) { ++ const dim3 blocks_num(nrows, nchannels, nsamples); ++ GGML_ASSERT(mul != nullptr); ++ const uint3 mc = init_fastdiv_values(mul_ncols); ++ const uint3 mr = init_fastdiv_values(mul_nrows); ++ const uint3 mch = init_fastdiv_values(mul_nchannels); ++ const uint3 ms = init_fastdiv_values(mul_nsamples); ++ if (ncols < 1024) { ++ const dim3 block_dims(256, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_mul_out<256, Tdst>, lp, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms); ++ } else { ++ const dim3 block_dims(1024, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_mul_out<1024, Tdst>, lp, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms); ++ } ++} ++ ++template ++static void rms_norm_pre_add_mul_out_cuda(const float * a, const float * b, float * h_out, Tdst * dst, ++ const int ncols, const int nrows, const int nchannels, const int nsamples, ++ const int64_t stride_row, const int64_t stride_channel, const int64_t stride_sample, ++ const float * mul, ++ const int64_t mul_stride_row, const int64_t mul_stride_channel, const int64_t mul_stride_sample, ++ const uint32_t mul_ncols, const uint32_t mul_nrows, const uint32_t mul_nchannels, const uint32_t mul_nsamples, ++ const float eps, cudaStream_t stream) { ++ const dim3 blocks_num(nrows, nchannels, nsamples); ++ GGML_ASSERT(mul != nullptr); ++ const uint3 mc = init_fastdiv_values(mul_ncols); ++ const uint3 mr = init_fastdiv_values(mul_nrows); ++ const uint3 mch = init_fastdiv_values(mul_nchannels); ++ const uint3 ms = init_fastdiv_values(mul_nsamples); ++ if (ncols < 1024) { ++ const dim3 block_dims(256, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_pre_add_mul_out<256, Tdst>, lp, ++ a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms); ++ } else { ++ const dim3 block_dims(1024, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_pre_add_mul_out<1024, Tdst>, lp, ++ a, b, h_out, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms); ++ } ++} ++ ++template ++static void rms_norm_gate_mul_out_cuda(const float * x, Tdst * dst, ++ const int ncols, const int nrows, const int nchannels, const int nsamples, ++ const int64_t stride_row, const int64_t stride_channel, const int64_t stride_sample, ++ const float * mul, ++ const int64_t mul_stride_row, const int64_t mul_stride_channel, const int64_t mul_stride_sample, ++ const uint32_t mul_ncols, const uint32_t mul_nrows, const uint32_t mul_nchannels, const uint32_t mul_nsamples, ++ const float * gate, ++ const int64_t gate_stride_row, const int64_t gate_stride_channel, const int64_t gate_stride_sample, ++ const uint32_t gate_ncols, const uint32_t gate_nrows, const uint32_t gate_nchannels, const uint32_t gate_nsamples, ++ const float eps, cudaStream_t stream) { ++ const dim3 blocks_num(nrows, nchannels, nsamples); ++ GGML_ASSERT(mul != nullptr); ++ GGML_ASSERT(gate != nullptr); ++ const uint3 mc = init_fastdiv_values(mul_ncols); ++ const uint3 mr = init_fastdiv_values(mul_nrows); ++ const uint3 mch = init_fastdiv_values(mul_nchannels); ++ const uint3 ms = init_fastdiv_values(mul_nsamples); ++ const uint3 gc = init_fastdiv_values(gate_ncols); ++ const uint3 gr = init_fastdiv_values(gate_nrows); ++ const uint3 gch = init_fastdiv_values(gate_nchannels); ++ const uint3 gs = init_fastdiv_values(gate_nsamples); ++ if (ncols < 1024) { ++ const dim3 block_dims(256, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_gate_mul_out<256, Tdst>, lp, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms, ++ gate, gate_stride_row, gate_stride_channel, gate_stride_sample, gc, gr, gch, gs); ++ } else { ++ const dim3 block_dims(1024, 1, 1); ++ const ggml_cuda_kernel_launch_params lp{blocks_num, block_dims, block_dims.x > WARP_SIZE ? 32 * sizeof(float) : 0, stream}; ++ ggml_cuda_kernel_launch(rms_norm_gate_mul_out<1024, Tdst>, lp, ++ x, dst, ncols, stride_row, stride_channel, stride_sample, eps, ++ mul, mul_stride_row, mul_stride_channel, mul_stride_sample, mc, mr, mch, ms, ++ gate, gate_stride_row, gate_stride_channel, gate_stride_sample, gc, gr, gch, gs); ++ } ++} ++ ++// =========================================================================== ++// host entries ++// =========================================================================== ++void ggml_cuda_rms_norm_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ void * dst_bf16) { ++ const ggml_tensor * x_src = rms_norm_tensor->src[0]; ++ const ggml_tensor * mul_src = (mul_tensor->src[0] == rms_norm_tensor) ? mul_tensor->src[1] : mul_tensor->src[0]; ++ GGML_ASSERT(mul_tensor->src[0] == rms_norm_tensor || mul_tensor->src[1] == rms_norm_tensor); ++ ++ float eps = 0.0f; ++ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float)); ++ GGML_ASSERT(eps >= 0.0f); ++ ++ GGML_ASSERT(x_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32); ++ ++ const float * x_d = (const float *) x_src->data; ++ const float * mul_d = (const float *) mul_src->data; ++ nv_bfloat16 * dst_d = (nv_bfloat16 *) dst_bf16; ++ cudaStream_t stream = ctx.stream(); ++ ++ const int64_t ne00 = rms_norm_tensor->ne[0]; ++ const int64_t ne01 = rms_norm_tensor->ne[1]; ++ const int64_t ne02 = rms_norm_tensor->ne[2]; ++ const int64_t ne03 = rms_norm_tensor->ne[3]; ++ ++ const size_t ts0 = ggml_type_size(x_src->type); ++ GGML_ASSERT(x_src->nb[0] == ts0); ++ const int64_t s01 = x_src->nb[1] / ts0; ++ const int64_t s02 = x_src->nb[2] / ts0; ++ const int64_t s03 = x_src->nb[3] / ts0; ++ ++ const size_t ts_mul = ggml_type_size(mul_src->type); ++ GGML_ASSERT(mul_src->nb[0] == ts_mul); ++ const int64_t mul_s01 = mul_src->nb[1] / ts_mul; ++ const int64_t mul_s02 = mul_src->nb[2] / ts_mul; ++ const int64_t mul_s03 = mul_src->nb[3] / ts_mul; ++ ++ rms_norm_mul_out_cuda(x_d, dst_d, ++ ne00, ne01, ne02, ne03, s01, s02, s03, ++ mul_d, mul_s01, mul_s02, mul_s03, ++ mul_src->ne[0], mul_src->ne[1], mul_src->ne[2], mul_src->ne[3], ++ eps, stream); ++} ++ ++void ggml_cuda_rms_norm_pre_add_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * add_tensor, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ void * dst_bf16) { ++ GGML_ASSERT(rms_norm_tensor->src[0] == add_tensor); ++ ++ const ggml_tensor * a_src = add_tensor->src[0]; ++ const ggml_tensor * b_src = add_tensor->src[1]; ++ ++ float eps = 0.0f; ++ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float)); ++ GGML_ASSERT(eps >= 0.0f); ++ ++ const ggml_tensor * mul_src = (mul_tensor->src[0] == rms_norm_tensor) ? mul_tensor->src[1] : mul_tensor->src[0]; ++ GGML_ASSERT(mul_tensor->src[0] == rms_norm_tensor || mul_tensor->src[1] == rms_norm_tensor); ++ ++ GGML_ASSERT(a_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(b_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(add_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(ggml_are_same_shape(a_src, b_src)); ++ ++ const float * a_d = (const float *) a_src->data; ++ const float * b_d = (const float *) b_src->data; ++ float * h_d = (float *) add_tensor->data; // f32 residual stream ++ const float * mul_d = (const float *) mul_src->data; ++ nv_bfloat16 * dst_d = (nv_bfloat16 *) dst_bf16; ++ cudaStream_t stream = ctx.stream(); ++ ++ const int64_t ne00 = add_tensor->ne[0]; ++ const int64_t ne01 = add_tensor->ne[1]; ++ const int64_t ne02 = add_tensor->ne[2]; ++ const int64_t ne03 = add_tensor->ne[3]; ++ ++ const size_t ts0 = ggml_type_size(a_src->type); ++ GGML_ASSERT(a_src->nb[0] == ts0 && b_src->nb[0] == ts0); ++ const int64_t s01 = a_src->nb[1] / ts0; ++ const int64_t s02 = a_src->nb[2] / ts0; ++ const int64_t s03 = a_src->nb[3] / ts0; ++ ++ const size_t ts_mul = ggml_type_size(mul_src->type); ++ GGML_ASSERT(mul_src->nb[0] == ts_mul); ++ const int64_t mul_s01 = mul_src->nb[1] / ts_mul; ++ const int64_t mul_s02 = mul_src->nb[2] / ts_mul; ++ const int64_t mul_s03 = mul_src->nb[3] / ts_mul; ++ ++ rms_norm_pre_add_mul_out_cuda(a_d, b_d, h_d, dst_d, ++ ne00, ne01, ne02, ne03, s01, s02, s03, ++ mul_d, mul_s01, mul_s02, mul_s03, ++ mul_src->ne[0], mul_src->ne[1], mul_src->ne[2], mul_src->ne[3], ++ eps, stream); ++} ++ ++void ggml_cuda_rms_norm_gate_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ const ggml_tensor * silu_tensor, ++ const ggml_tensor * gate_mul_tensor, ++ void * dst_bf16) { ++ GGML_ASSERT(mul_tensor->src[0] == rms_norm_tensor || mul_tensor->src[1] == rms_norm_tensor); ++ GGML_ASSERT(gate_mul_tensor->src[0] == silu_tensor || gate_mul_tensor->src[1] == silu_tensor); ++ ++ const ggml_tensor * x_src = rms_norm_tensor->src[0]; ++ const ggml_tensor * w_src = (mul_tensor->src[0] == rms_norm_tensor) ? mul_tensor->src[1] : mul_tensor->src[0]; ++ const ggml_tensor * gate_src = silu_tensor->src[0]; ++ ++ float eps = 0.0f; ++ memcpy(&eps, rms_norm_tensor->op_params, sizeof(float)); ++ GGML_ASSERT(eps >= 0.0f); ++ ++ const float * x_d = (const float *) x_src->data; ++ const float * w_d = (const float *) w_src->data; ++ const float * gate_d = (const float *) gate_src->data; ++ nv_bfloat16 * dst_d = (nv_bfloat16 *) dst_bf16; ++ cudaStream_t stream = ctx.stream(); ++ ++ GGML_ASSERT(x_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(w_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(gate_src->type == GGML_TYPE_F32); ++ GGML_ASSERT(rms_norm_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(mul_tensor->type == GGML_TYPE_F32); ++ GGML_ASSERT(silu_tensor->type == GGML_TYPE_F32); ++ ++ const int64_t ne00 = rms_norm_tensor->ne[0]; ++ const int64_t ne01 = rms_norm_tensor->ne[1]; ++ const int64_t ne02 = rms_norm_tensor->ne[2]; ++ const int64_t ne03 = rms_norm_tensor->ne[3]; ++ ++ const size_t ts0 = ggml_type_size(x_src->type); ++ GGML_ASSERT(x_src->nb[0] == ts0); ++ const int64_t s01 = x_src->nb[1] / ts0; ++ const int64_t s02 = x_src->nb[2] / ts0; ++ const int64_t s03 = x_src->nb[3] / ts0; ++ ++ const size_t ts_mul = ggml_type_size(w_src->type); ++ GGML_ASSERT(w_src->nb[0] == ts_mul); ++ const int64_t mul_s01 = w_src->nb[1] / ts_mul; ++ const int64_t mul_s02 = w_src->nb[2] / ts_mul; ++ const int64_t mul_s03 = w_src->nb[3] / ts_mul; ++ ++ const size_t ts_gate = ggml_type_size(gate_src->type); ++ GGML_ASSERT(gate_src->nb[0] == ts_gate); ++ const int64_t gate_s01 = gate_src->nb[1] / ts_gate; ++ const int64_t gate_s02 = gate_src->nb[2] / ts_gate; ++ const int64_t gate_s03 = gate_src->nb[3] / ts_gate; ++ ++ rms_norm_gate_mul_out_cuda(x_d, dst_d, ++ ne00, ne01, ne02, ne03, s01, s02, s03, ++ w_d, mul_s01, mul_s02, mul_s03, ++ w_src->ne[0], w_src->ne[1], w_src->ne[2], w_src->ne[3], ++ gate_d, gate_s01, gate_s02, gate_s03, ++ gate_src->ne[0], gate_src->ne[1], gate_src->ne[2], gate_src->ne[3], ++ eps, stream); ++} +diff --git a/ggml/src/ggml-cuda/norm-bf16.cuh b/ggml/src/ggml-cuda/norm-bf16.cuh +new file mode 100644 +index 000000000..3b52757b8 +--- /dev/null ++++ b/ggml/src/ggml-cuda/norm-bf16.cuh +@@ -0,0 +1,37 @@ ++#include "common.cuh" ++ ++// [P1 bf16-stream] bf16-resident execution-pass helpers (default-off, gated by ++// LLAMA_BF16_STREAM at the ggml_cuda_try_fuse call site). Siblings of the plain ++// rms_norm and the 0042/0044 fused norms in norm.cu; these variants write a bf16 ++// output so the consuming projection GEMM reads the activation directly (no f32->bf16 ++// convert_dtype glue). Each kernel is templated on the output dtype (float or ++// nv_bfloat16) so the file carries the full op-variant set the P1 contract names; ++// the live q36 engage path instantiates the bf16 output only. ++// ++// All three keep the same reduction and FP order as their f32 originals; only the ++// final store is narrowed to bf16 via __float2bfloat16, so the opt-in path stays a ++// pure dtype (KL-benign) change, not an algorithmic one. ++ ++// plain rms_norm + weight multiply -> bf16 (attention input norm, GDN input norm) ++void ggml_cuda_rms_norm_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ void * dst_bf16); ++ ++// 0042 residual-add + rms_norm + weight multiply -> f32 residual (h_out) + bf16 normed ++// (op-set completeness; on q36 the ffn/moe-input norm feeds MMQ experts so the engage ++// path bails, but the kernel + entry keep the op-variant set whole and the sentinel ++// exercises it). ++void ggml_cuda_rms_norm_pre_add_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * add_tensor, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ void * dst_bf16); ++ ++// 0044 gated-DeltaNet output norm scale*x*w*silu(z) -> bf16 (ssm_out; the P0 segment) ++void ggml_cuda_rms_norm_gate_mul_bf16out(ggml_backend_cuda_context & ctx, ++ const ggml_tensor * rms_norm_tensor, ++ const ggml_tensor * mul_tensor, ++ const ggml_tensor * silu_tensor, ++ const ggml_tensor * gate_mul_tensor, ++ void * dst_bf16); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0054-feat-paged-P1-bf16-stream-bf16-residual-add-rope-op-.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0054-feat-paged-P1-bf16-stream-bf16-residual-add-rope-op-.patch new file mode 100644 index 000000000000..c60f1c8ec57f --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0054-feat-paged-P1-bf16-stream-bf16-residual-add-rope-op-.patch @@ -0,0 +1,107 @@ +From 91373e1b9ab290eb9df63ce26e7cd17da81970fe Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 2 Jul 2026 16:29:20 +0200 +Subject: [PATCH 54/55] feat(paged): P1 bf16-stream bf16 residual-add + rope + op-variants + +Round out the op-variant set for the bf16-resident stream: bf16 branches in +binbcast.cu (residual add) and bf16 rope instantiations (asserts widened +only). Standing infra; Option-A keeps f32 at segment boundaries so these are +not on the current measured path. Existing f32 paths untouched. + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + ggml/src/ggml-cuda/binbcast.cu | 10 +++++++++- + ggml/src/ggml-cuda/rope.cu | 31 ++++++++++++++++++++++++++++--- + 2 files changed, 37 insertions(+), 4 deletions(-) + +diff --git a/ggml/src/ggml-cuda/binbcast.cu b/ggml/src/ggml-cuda/binbcast.cu +index 2e38077bf..135becedb 100644 +--- a/ggml/src/ggml-cuda/binbcast.cu ++++ b/ggml/src/ggml-cuda/binbcast.cu +@@ -413,7 +413,7 @@ static void ggml_cuda_op_bin_bcast( + const ggml_tensor * src0, const ggml_tensor * src1, ggml_tensor * dst, + const void * src0_dd, const void * src1_dd, void * dst_dd, cudaStream_t stream) { + +- GGML_ASSERT(src1->type == GGML_TYPE_F32 || src1->type == GGML_TYPE_F16); ++ GGML_ASSERT(src1->type == GGML_TYPE_F32 || src1->type == GGML_TYPE_F16 || src1->type == GGML_TYPE_BF16); + + if (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F32) { + op()(src0, src1, dst, (const float *)src0_dd, (const float *)src1_dd, (float *)dst_dd, stream); +@@ -423,6 +423,14 @@ static void ggml_cuda_op_bin_bcast( + op()(src0, src1, dst, (const half *) src0_dd, (const float *)src1_dd, (half *) dst_dd, stream); + } else if (src0->type == GGML_TYPE_F16 && dst->type == GGML_TYPE_F32) { + op()(src0, src1, dst, (const half *) src0_dd, (const float *)src1_dd, (float *)dst_dd, stream); ++ // [P1 bf16-stream] bf16 residual-add variants, so a bf16-resident segment can keep ++ // its residual add in bf16 (half the memory traffic) rather than widening to f32. ++ } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_BF16 && dst->type == GGML_TYPE_BF16) { ++ op()(src0, src1, dst, (const nv_bfloat16 *) src0_dd, (const nv_bfloat16 *) src1_dd, (nv_bfloat16 *) dst_dd, stream); ++ } else if (src0->type == GGML_TYPE_BF16 && src1->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_BF16) { ++ op()(src0, src1, dst, (const nv_bfloat16 *) src0_dd, (const float *) src1_dd, (nv_bfloat16 *) dst_dd, stream); ++ } else if (src0->type == GGML_TYPE_BF16 && dst->type == GGML_TYPE_F32) { ++ op()(src0, src1, dst, (const nv_bfloat16 *) src0_dd, (const float *) src1_dd, (float *) dst_dd, stream); + } else { + fprintf(stderr, "%s: unsupported types: dst: %s, src0: %s, src1: %s\n", __func__, + ggml_type_name(dst->type), ggml_type_name(src0->type), ggml_type_name(src1->type)); +diff --git a/ggml/src/ggml-cuda/rope.cu b/ggml/src/ggml-cuda/rope.cu +index e20a5cb6b..923032327 100644 +--- a/ggml/src/ggml-cuda/rope.cu ++++ b/ggml/src/ggml-cuda/rope.cu +@@ -528,11 +528,16 @@ void ggml_cuda_op_rope_impl(ggml_backend_cuda_context & ctx, + } + cudaStream_t stream = ctx.stream(); + +- GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16); +- GGML_ASSERT( dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16); ++ // [P1 bf16-stream] bf16 is accepted so a bf16-resident attention segment can rope ++ // its Q/K in bf16 (the norm/neox kernels are float-internal, so the bf16 arms just ++ // add T/D = nv_bfloat16 instantiations below). ++ GGML_ASSERT(src0->type == GGML_TYPE_F32 || src0->type == GGML_TYPE_F16 || src0->type == GGML_TYPE_BF16); ++ GGML_ASSERT( dst->type == GGML_TYPE_F32 || dst->type == GGML_TYPE_F16 || dst->type == GGML_TYPE_BF16); + // When not fused, src0 and dst types must match + // When fused (ROPE+VIEW+SET_ROWS), src0 may be F32 and dst may be F16 +- GGML_ASSERT(src0->type == dst->type || (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F16)); ++ GGML_ASSERT(src0->type == dst->type || ++ (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_F16) || ++ (src0->type == GGML_TYPE_F32 && dst->type == GGML_TYPE_BF16)); + + const int64_t ne00 = src0->ne[0]; // head dims + const int64_t ne01 = src0->ne[1]; // num heads +@@ -610,6 +615,16 @@ void ggml_cuda_op_rope_impl(ggml_backend_cuda_context & ctx, + s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, + ext_factor, attn_factor, corr_dims, freq_factors, row_indices, + set_rows_stride, stream); ++ } else if (src0->type == GGML_TYPE_BF16 && dst_type == GGML_TYPE_BF16) { ++ rope_neox_cuda((const nv_bfloat16 *) src0_d, (nv_bfloat16 *) dst_d, ne00, ne01, ne02, s01, s02, ++ s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, ++ ext_factor, attn_factor, corr_dims, freq_factors, row_indices, ++ set_rows_stride, stream); ++ } else if (src0->type == GGML_TYPE_F32 && dst_type == GGML_TYPE_BF16) { ++ rope_neox_cuda((const float *) src0_d, (nv_bfloat16 *) dst_d, ne00, ne01, ne02, s01, s02, ++ s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, ++ ext_factor, attn_factor, corr_dims, freq_factors, row_indices, ++ set_rows_stride, stream); + } else { + GGML_ABORT("fatal error"); + } +@@ -653,6 +668,16 @@ void ggml_cuda_op_rope_impl(ggml_backend_cuda_context & ctx, + s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, + ext_factor, attn_factor, corr_dims, freq_factors, row_indices, + set_rows_stride, stream); ++ } else if (src0->type == GGML_TYPE_BF16 && dst_type == GGML_TYPE_BF16) { ++ rope_norm_cuda((const nv_bfloat16 *) src0_d, (nv_bfloat16 *) dst_d, ne00, ne01, ne02, s01, s02, ++ s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, ++ ext_factor, attn_factor, corr_dims, freq_factors, row_indices, ++ set_rows_stride, stream); ++ } else if (src0->type == GGML_TYPE_F32 && dst_type == GGML_TYPE_BF16) { ++ rope_norm_cuda((const float *) src0_d, (nv_bfloat16 *) dst_d, ne00, ne01, ne02, s01, s02, ++ s03, s1, s2, s3, n_dims, nr, pos, freq_scale, freq_base, ++ ext_factor, attn_factor, corr_dims, freq_factors, row_indices, ++ set_rows_stride, stream); + } else { + GGML_ABORT("fatal error"); + } +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-test-paged-P1-bf16-stream-BF16_STREAM_SEGMENT-sentin.patch b/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-test-paged-P1-bf16-stream-BF16_STREAM_SEGMENT-sentin.patch new file mode 100644 index 000000000000..133365eb11c8 --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/patches/paged/0055-test-paged-P1-bf16-stream-BF16_STREAM_SEGMENT-sentin.patch @@ -0,0 +1,116 @@ +From 653bb2f3d5914872010eac29287863bff67de943 Mon Sep 17 00:00:00 2001 +From: Ettore Di Giacinto +Date: Thu, 2 Jul 2026 16:29:20 +0200 +Subject: [PATCH 55/55] test(paged): P1 bf16-stream BF16_STREAM_SEGMENT + sentinel + +Whole-graph test-backend-ops case (MOE_SWIGLU_DOWN style) that engages both +segment kinds (plain rms_norm multi-consumer and the 0044 gate_mul ssm_out) +under LLAMA_BF16_STREAM. Green default and opt-in (4/4). + +Assisted-by: Claude:opus-4.8 [Claude Code] +Signed-off-by: Ettore Di Giacinto +--- + tests/test-backend-ops.cpp | 79 ++++++++++++++++++++++++++++++++++++++ + 1 file changed, 79 insertions(+) + +diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp +index 8c41ae56a..d0c56521c 100644 +--- a/tests/test-backend-ops.cpp ++++ b/tests/test-backend-ops.cpp +@@ -4532,6 +4532,78 @@ struct test_moe_swiglu_down : public test_case { + } + }; + ++// [P1 bf16-stream] Standing coverage for the bf16-native residual-stream segment ++// executor (LLAMA_BF16_STREAM). Builds a residual-stream norm feeding a large-M ++// cuBLAS-bf16 projection (the exact shape ggml_cuda_try_fuse owns): kind 1 = plain ++// rms_norm+mul -> proj (attention / GDN input norm), kind 2 = 0044 gated-DeltaNet ++// output norm silu(z)*rms(x)*w -> proj (ssm_out). Whole-graph so the fusion pass runs; ++// with the env off it validates the f32 path, and with LLAMA_BF16_STREAM=1 it validates ++// the bf16-activation path (the projection reads the bf16 norm output directly). The ++// weight is BF16 so the projection deterministically routes to the cuBLAS-bf16 branch. ++struct test_bf16_stream_segment : public test_case { ++ const int kind; // 1 = plain rms+mul, 2 = gated-DeltaNet output norm ++ const int64_t n_embd; // K (contraction) ++ const int64_t n_out; // N (projection rows) ++ const int64_t n_tokens; // M (>= 128 so the executor engages) ++ ++ std::string vars() override { ++ return VARS_TO_STR4(kind, n_embd, n_out, n_tokens); ++ } ++ ++ double max_nmse_err() override { ++ // bf16 activation rounding on both the default and opt-in paths (the BF16 weight ++ // GEMM already rounds the activation); generous but tight enough to catch a bug. ++ return 1e-2; ++ } ++ ++ uint64_t op_flops(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return 2 * n_embd * n_out * n_tokens; ++ } ++ ++ test_bf16_stream_segment(int kind = 1, int64_t n_embd = 4096, int64_t n_out = 2048, int64_t n_tokens = 256) ++ : kind(kind), n_embd(n_embd), n_out(n_out), n_tokens(n_tokens) {} ++ ++ ggml_tensor * build_graph(ggml_context * ctx) override { ++ const float eps = 1e-5f; ++ ++ ggml_tensor * x = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, n_tokens); ++ ggml_set_name(x, "x"); ++ ggml_tensor * w = ggml_new_tensor_1d(ctx, GGML_TYPE_F32, n_embd); ++ ggml_set_name(w, "rms_w"); ++ ggml_tensor * proj_w = ggml_new_tensor_2d(ctx, GGML_TYPE_BF16, n_embd, n_out); ++ ggml_set_name(proj_w, "proj_w"); ++ ++ ggml_tensor * rms = ggml_rms_norm(ctx, x, eps); ++ ggml_set_name(rms, "rms"); ++ ggml_tensor * mul = ggml_mul(ctx, rms, w); ++ ggml_set_name(mul, "rms_mul"); ++ ++ ggml_tensor * norm_out = mul; ++ if (kind == 2) { ++ ggml_tensor * z = ggml_new_tensor_2d(ctx, GGML_TYPE_F32, n_embd, n_tokens); ++ ggml_set_name(z, "gate_z"); ++ ggml_tensor * silu = ggml_silu(ctx, z); ++ ggml_set_name(silu, "silu"); ++ // silu as src0 so the topological order is {SILU, RMS_NORM, MUL, MUL}, ++ // matching the fusion-pass pattern (and the real q36 gated-output norm). ++ norm_out = ggml_mul(ctx, silu, mul); ++ ggml_set_name(norm_out, "gate_mul"); ++ } ++ ++ ggml_tensor * out = ggml_mul_mat(ctx, proj_w, norm_out); ++ ggml_set_name(out, "proj"); ++ return out; ++ } ++ ++ bool run_whole_graph() override { return true; } ++ ++ std::string op_desc(ggml_tensor * t) override { ++ GGML_UNUSED(t); ++ return "BF16_STREAM_SEGMENT"; ++ } ++}; ++ + // MoE down projection -> router-weight multiply -> rank-ordered expert add. + struct test_moe_weighted_combine : public test_case { + const ggml_type type_a; +@@ -9043,6 +9115,13 @@ static std::vector> make_test_cases_eval() { + test_cases.emplace_back(new test_moe_swiglu_down(GGML_TYPE_NVFP4, 128, 8, 768, n, 2048)); + } + ++ // [P1 bf16-stream] bf16-native residual-stream segment executor coverage. Small-M ++ // (bails, f32 path) + large-M (engages under LLAMA_BF16_STREAM=1) for both norm kinds. ++ for (int kind : {1, 2}) { ++ test_cases.emplace_back(new test_bf16_stream_segment(kind, 4096, 2048, 64)); ++ test_cases.emplace_back(new test_bf16_stream_segment(kind, 4096, 2048, 256)); ++ } ++ + // [paged Phase 7] MoE down projection -> router-weight multiply -> rank-ordered + // expert add gate for the weighted-combine fusion candidate. + test_cases.emplace_back(new test_moe_weighted_combine(GGML_TYPE_F32, 8, 2, 32, 8, 64)); +-- +2.43.0 + diff --git a/backend/cpp/llama-cpp-localai-paged/run.sh b/backend/cpp/llama-cpp-localai-paged/run.sh new file mode 100755 index 000000000000..93252ff13f7a --- /dev/null +++ b/backend/cpp/llama-cpp-localai-paged/run.sh @@ -0,0 +1,51 @@ +#!/bin/bash +set -ex + +# Get the absolute current dir where the script is located +CURDIR=$(dirname "$(realpath $0)") + +cd / + +echo "CPU info:" +grep -e "model\sname" /proc/cpuinfo | head -1 +grep -e "flags" /proc/cpuinfo | head -1 + +BINARY=llama-cpp-localai-paged-fallback + +# x86/arm64 ship a single llama-cpp-localai-paged-cpu-all built with ggml +# CPU_ALL_VARIANTS: ggml's backend registry dlopens the best libggml-cpu-*.so for +# this host, so no shell-side probing. ROCm ships only the fallback, so fall back +# to it when cpu-all is absent. +if [ -e $CURDIR/llama-cpp-localai-paged-cpu-all ]; then + BINARY=llama-cpp-localai-paged-cpu-all +fi + +if [ -n "$LLAMACPP_GRPC_SERVERS" ]; then + if [ -e $CURDIR/llama-cpp-localai-paged-grpc ]; then + BINARY=llama-cpp-localai-paged-grpc + fi +fi + +# Extend ld library path with the dir where this script is located/lib +if [ "$(uname)" == "Darwin" ]; then + export DYLD_LIBRARY_PATH=$CURDIR/lib:$DYLD_LIBRARY_PATH +else + export LD_LIBRARY_PATH=$CURDIR/lib:$LD_LIBRARY_PATH + # Tell rocBLAS where to find TensileLibrary data (GPU kernel tuning files) + if [ -d "$CURDIR/lib/rocblas/library" ]; then + export ROCBLAS_TENSILE_LIBPATH=$CURDIR/lib/rocblas/library + fi +fi + +# If there is a lib/ld.so, use it +if [ -f $CURDIR/lib/ld.so ]; then + echo "Using lib/ld.so" + echo "Using binary: $BINARY" + exec $CURDIR/lib/ld.so $CURDIR/$BINARY "$@" +fi + +echo "Using binary: $BINARY" +exec $CURDIR/$BINARY "$@" + +# We should never reach this point, however just in case we do, run fallback +exec $CURDIR/llama-cpp-localai-paged-fallback "$@" diff --git a/backend/cpp/llama-cpp/Makefile b/backend/cpp/llama-cpp/Makefile index f542ac3b43c0..cefaa95e2b4e 100644 --- a/backend/cpp/llama-cpp/Makefile +++ b/backend/cpp/llama-cpp/Makefile @@ -1,4 +1,9 @@ +# This pin is auto-bumped nightly by .github/workflows/bump_deps.yaml (the stock +# llama-cpp backend is patch-free, so a naive bump is safe). The paged backend +# (backend/cpp/llama-cpp-localai-paged) does NOT inherit this pin: it owns its +# own LLAMA_VERSION because its vendored patch series would break on a naive +# bump and is advanced only by the manual PIN_SYNC process. LLAMA_VERSION?=0ed235ea2c17a19fc8238668653946721ed136fd LLAMA_REPO?=https://github.com/ggerganov/llama.cpp @@ -169,7 +174,12 @@ llama.cpp: git remote add origin $(LLAMA_REPO) && \ git fetch --all --tags && \ git checkout -b build $(LLAMA_VERSION) && \ - git submodule update --init --recursive --depth 1 --single-branch + git submodule update --init --recursive --depth 1 --single-branch && \ + for p in $(CURRENT_MAKEFILE_DIR)patches/0*.patch; do \ + [ -e "$$p" ] || continue; \ + echo "applying llama.cpp patch: $$p"; \ + git apply --verbose "$$p" || { echo "patch failed: $$p"; exit 1; }; \ + done llama.cpp/tools/grpc-server: llama.cpp mkdir -p llama.cpp/tools/grpc-server diff --git a/backend/cpp/llama-cpp/grpc-server.cpp b/backend/cpp/llama-cpp/grpc-server.cpp index a02d461f46dd..d9be2310992f 100644 --- a/backend/cpp/llama-cpp/grpc-server.cpp +++ b/backend/cpp/llama-cpp/grpc-server.cpp @@ -763,6 +763,97 @@ static void params_parse(server_context& /*ctx_server*/, const backend::ModelOpt } else if (optval_str == "false" || optval_str == "0" || optval_str == "no" || optval_str == "off" || optval_str == "disabled") { params.kv_unified = false; } + // --- paged KV cache (experimental, off by default) --- + // Enables the on-demand paged KV-cache engine (vendored PagedKVManager + // + paged placement/gather/alloc seams). The engine is gated inside + // llama.cpp by the LLAMA_KV_PAGED env var, evaluated once at first use; + // here we expose it as a per-server model option instead of forcing the + // operator to export a process-wide env. When enabled we set the env + // BEFORE the model/context is created (later in this handler), so the + // engine latches on. When the option is absent we touch nothing, so an + // externally exported LLAMA_KV_PAGED still works as an escape hatch. + // Note: the engine's env check is process-wide and latches on first + // use, so enabling it for one model enables it for the worker process; + // LocalAI runs one model per llama.cpp worker, so this maps cleanly to + // per-server configuration. `kv_paged_debug` turns on the per-slot + // [paged-alloc]/free trace (LLAMA_KV_PAGED_DEBUG). + // + // The continuous-batching serving loop (update_slots) drives paged KV + // transparently through the existing kv-cache seams: each slot's + // sequence allocates paged blocks on arrival (find_slot placement) and + // returns them on slot release (the seq_rm free seam). This is + // token-identical to stock under both the unified and per-sequence + // caches. The per-slot allocate/free capacity benefit, however, only + // materialises with a per-sequence cache, since paged block ownership + // is keyed by stream and the unified cache collapses every slot onto a + // single stream. Operators who want that benefit should pair this with + // `kv_unified:false`; we do NOT flip kv_unified here, to keep the + // default serving behaviour (and the idle-slot prompt cache) unchanged. + } else if (!strcmp(optname, "kv_paged") || !strcmp(optname, "paged_kv") || !strcmp(optname, "paged_attention")) { + if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") { + setenv("LLAMA_KV_PAGED", "1", 1); + } + } else if (!strcmp(optname, "kv_paged_debug") || !strcmp(optname, "paged_kv_debug")) { + if (optval_str == "true" || optval_str == "1" || optval_str == "yes" || optval_str == "on" || optval_str == "enabled") { + setenv("LLAMA_KV_PAGED_DEBUG", "1", 1); + } + // --- chunked-prefill QoS budget (experimental, off by default) --- + // Caps the number of prompt tokens any single slot may prefill per + // update_slots iteration, so a large prompt cannot monopolise the batch + // and freeze the in-flight decoders. The serving loop reads this budget + // from the LLAMA_PREFILL_BUDGET env var (set BEFORE context init, like + // kv_paged above) and splits oversized prompts across iterations, + // interleaving decode steps for the other slots. A 6k-token prefill that + // stalled 8 decoders ~3.4s drops to ~780ms at budget=512 (4.8x stall + // cut) with zero TTFT cost and no steady-state regression. Unset or a + // non-positive value leaves the env untouched, so the stock unbounded + // prefill behaviour is preserved (an externally exported + // LLAMA_PREFILL_BUDGET still works as an escape hatch). + } else if (!strcmp(optname, "max_prefill_tokens") || !strcmp(optname, "mpt") || !strcmp(optname, "prefill_budget")) { + if (optval != NULL) { + try { + int budget = std::stoi(optval_str); + if (budget > 0) { + setenv("LLAMA_PREFILL_BUDGET", std::to_string(budget).c_str(), 1); + } + } catch (const std::exception& e) { + // If conversion fails, leave the budget unset (stock behaviour) + } + } + // --- dynamic decode-first prefill budget (patch 0016, continuous-batch P1) --- + // Supersedes max_prefill_tokens (the static patch-0013 cap) with the dynamic + // T - D budget read by update_slots(): a single total per-step token budget T + // (max_batch_tokens / mbt, the vLLM max_num_batched_tokens analogue) of which + // decode claims its live load D first and prefill gets the leftover, plus an + // optional per-slot prompt-chunk cap (prefill_cap, the long_prefill_token_ + // threshold analogue). Both are set BEFORE context init, like kv_paged / + // max_prefill_tokens above. Unset leaves the env untouched, so the engine stays + // byte-identical to stock (an externally exported LLAMA_MAX_BATCH_TOKENS / + // LLAMA_PREFILL_CAP still works as an escape hatch). When max_batch_tokens is set + // it takes precedence over max_prefill_tokens: the engine honours the legacy + // LLAMA_PREFILL_BUDGET only when the dynamic knob is unset. + } else if (!strcmp(optname, "max_batch_tokens") || !strcmp(optname, "mbt")) { + if (optval != NULL) { + try { + int mbt = std::stoi(optval_str); + if (mbt > 0) { + setenv("LLAMA_MAX_BATCH_TOKENS", std::to_string(mbt).c_str(), 1); + } + } catch (const std::exception& e) { + // If conversion fails, leave the budget unset (stock behaviour) + } + } + } else if (!strcmp(optname, "prefill_cap")) { + if (optval != NULL) { + try { + int cap = std::stoi(optval_str); + if (cap > 0) { + setenv("LLAMA_PREFILL_CAP", std::to_string(cap).c_str(), 1); + } + } catch (const std::exception& e) { + // If conversion fails, leave the per-slot cap unset (engine default) + } + } } else if (!strcmp(optname, "n_ctx_checkpoints") || !strcmp(optname, "ctx_checkpoints")) { if (optval != NULL) { try { diff --git a/backend/cpp/llama-cpp/prepare.sh b/backend/cpp/llama-cpp/prepare.sh index 4da45ea9d8df..b55e89f0f7dc 100644 --- a/backend/cpp/llama-cpp/prepare.sh +++ b/backend/cpp/llama-cpp/prepare.sh @@ -2,12 +2,18 @@ ## Patches -## Apply patches from the `patches` directory +## Apply the base `patches/` series (top-level *.patch only; *.md/dirs skipped). +## The stock llama-cpp backend is patch-free by default, so this normally does +## nothing. The Makefile `llama.cpp` target already `git apply`s any base patch +## at checkout, so each apply here is `-N` (skip already-applied): re-applying a +## git-format patch with `patch` would fuzzily duplicate hunks. This block only +## does real work if prepare.sh is run against an unpatched checkout. if [ -d "patches" ]; then - for patch in $(ls patches); do + for patch in patches/*.patch; do + [ -e "$patch" ] || continue echo "Applying patch $patch" - patch -d llama.cpp/ -p1 < patches/$patch - done + patch -d llama.cpp/ -p1 -N -r - < "$patch" || true + done fi set -e diff --git a/backend/index.yaml b/backend/index.yaml index 1eb02e1ba117..64f8e3788030 100644 --- a/backend/index.yaml +++ b/backend/index.yaml @@ -72,6 +72,37 @@ nvidia-cuda-12: "cuda12-turboquant" nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant" +- &llamacpplocalaipaged + name: "llama-cpp-localai-paged" + alias: "llama-cpp-localai-paged" + license: mit + icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png + description: | + LocalAI's paged-attention llama.cpp variant: on-demand paged KV cache plus a + decode-first prefill budget. The SAME upstream llama.cpp grpc-server as the + stock llama-cpp backend, with the LocalAI paged patch series applied + (vendored in this backend). Tuned for NVFP4 dense / MoE on Blackwell / GB10. Reuses the + llama-cpp gRPC server sources; the paged engine is gated at runtime by the + paged_kv / max_batch_tokens model options. + urls: + - https://github.com/ggerganov/llama.cpp + tags: + - text-to-text + - LLM + - GPU + - CUDA + - paged-attention + - nvfp4 + # CUDA-only: the paged patchset's wins (GDN fusions, NVFP4 FP4-MMA) are + # CUDA/Blackwell-specific; off-CUDA they gate off and the backend is + # neutral-to-negative, so non-CUDA users should use the stock llama-cpp + # backend. default points at cuda12 (mirrors faster-qwen3-tts) so the gallery + # entries always resolve to a CUDA variant. + capabilities: + default: "cuda13-llama-cpp-localai-paged" + nvidia: "cuda13-llama-cpp-localai-paged" + nvidia-cuda-13: "cuda13-llama-cpp-localai-paged" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged" - &ds4 name: "ds4" alias: "ds4" @@ -1710,6 +1741,13 @@ nvidia-cuda-12: "cuda12-turboquant-development" nvidia-l4t-cuda-12: "nvidia-l4t-arm64-turboquant-development" nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-turboquant-development" +- !!merge <<: *llamacpplocalaipaged + name: "llama-cpp-localai-paged-development" + capabilities: + default: "cuda13-llama-cpp-localai-paged-development" + nvidia: "cuda13-llama-cpp-localai-paged-development" + nvidia-cuda-13: "cuda13-llama-cpp-localai-paged-development" + nvidia-l4t-cuda-13: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged-development" - !!merge <<: *ds4 name: "ds4-development" capabilities: @@ -2378,6 +2416,27 @@ uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant" mirrors: - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-turboquant +## llama-cpp-localai-paged (CUDA-only; see backend/cpp/llama-cpp-localai-paged/README.md section 4c) +- !!merge <<: *llamacpplocalaipaged + name: "cuda13-llama-cpp-localai-paged" + uri: "quay.io/go-skynet/local-ai-backends:latest-gpu-nvidia-cuda-13-llama-cpp-localai-paged" + mirrors: + - localai/localai-backends:latest-gpu-nvidia-cuda-13-llama-cpp-localai-paged +- !!merge <<: *llamacpplocalaipaged + name: "cuda13-llama-cpp-localai-paged-development" + uri: "quay.io/go-skynet/local-ai-backends:master-gpu-nvidia-cuda-13-llama-cpp-localai-paged" + mirrors: + - localai/localai-backends:master-gpu-nvidia-cuda-13-llama-cpp-localai-paged +- !!merge <<: *llamacpplocalaipaged + name: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged" + uri: "quay.io/go-skynet/local-ai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged" + mirrors: + - localai/localai-backends:latest-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged +- !!merge <<: *llamacpplocalaipaged + name: "cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged-development" + uri: "quay.io/go-skynet/local-ai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged" + mirrors: + - localai/localai-backends:master-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged ## ds4 - !!merge <<: *ds4 name: "cpu-ds4" diff --git a/core/backend/hardware_defaults.go b/core/backend/hardware_defaults.go new file mode 100644 index 000000000000..4c915d69a04d --- /dev/null +++ b/core/backend/hardware_defaults.go @@ -0,0 +1,43 @@ +package backend + +// Hardware-specific backend defaults. +// +// This file centralizes tuning that depends on the *detected hardware* rather +// than on the model config. The model config (explicit `batch:`, `context_size:` +// …) always takes precedence; these helpers only fill values the user left +// unset, so behavior is unchanged unless the matching hardware is present. +// +// Placement note: this runs in the process that builds the gRPC ModelOptions +// sent to every backend (including the C++ llama.cpp grpc-server), so it is the +// one common point that covers all backends. For distributed setups where the +// backend runs on a different host than the orchestrator, worker-side detection +// (e.g. the C++ backend reading cudaGetDeviceProperties) would be more precise; +// this single-host default is the pragmatic common case. + +import ( + "github.com/mudler/LocalAI/pkg/xsysinfo" + "github.com/mudler/xlog" +) + +// BlackwellBatchSize is the physical batch (n_batch/n_ubatch) default on NVIDIA +// Blackwell consumer GPUs (sm_120/121, incl. GB10 / DGX Spark). A larger +// physical batch materially lifts MoE prefill throughput there (per-expert GEMM +// tiles fill better); measured on a GB10 with Qwen3-30B-A3B to lift the prefill +// ceiling ~+10-15% and saturate around 2048. Only applied when the model config +// does not set an explicit `batch:`. +const BlackwellBatchSize = 2048 + +// detectBlackwellGPU is a seam over xsysinfo.IsNVIDIABlackwell so tests can +// force the hardware branch deterministically. +var detectBlackwellGPU = xsysinfo.IsNVIDIABlackwell + +// hardwareDefaultBatchSize returns the physical-batch default for the detected +// hardware, falling back to the given value when no hardware-specific tuning +// applies. Used by EffectiveBatchSize only when the config leaves batch unset. +func hardwareDefaultBatchSize(fallback int) int { + if detectBlackwellGPU() { + xlog.Debug("Blackwell GPU detected; defaulting physical batch higher for MoE prefill", "batch", BlackwellBatchSize) + return BlackwellBatchSize + } + return fallback +} diff --git a/core/backend/hardware_defaults_internal_test.go b/core/backend/hardware_defaults_internal_test.go new file mode 100644 index 000000000000..df621cded4dd --- /dev/null +++ b/core/backend/hardware_defaults_internal_test.go @@ -0,0 +1,50 @@ +package backend + +import ( + "github.com/mudler/LocalAI/core/config" + . "github.com/onsi/ginkgo/v2" + . "github.com/onsi/gomega" +) + +var _ = Describe("hardware-specific defaults", func() { + var origDetect func() bool + + BeforeEach(func() { + origDetect = detectBlackwellGPU + }) + AfterEach(func() { + detectBlackwellGPU = origDetect + }) + + Describe("hardwareDefaultBatchSize", func() { + It("returns the fallback when not Blackwell", func() { + detectBlackwellGPU = func() bool { return false } + Expect(hardwareDefaultBatchSize(512)).To(Equal(512)) + }) + + It("returns BlackwellBatchSize on Blackwell", func() { + detectBlackwellGPU = func() bool { return true } + Expect(hardwareDefaultBatchSize(512)).To(Equal(BlackwellBatchSize)) + }) + }) + + Describe("EffectiveBatchSize on Blackwell", func() { + threads := 1 + ctx := 4096 + + It("defaults an unset batch to 2048 on Blackwell", func() { + detectBlackwellGPU = func() bool { return true } + cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}} + opts := grpcModelOpts(cfg, "/tmp/models") + Expect(opts.NBatch).To(BeEquivalentTo(BlackwellBatchSize)) + }) + + It("keeps an explicit batch over the Blackwell default", func() { + detectBlackwellGPU = func() bool { return true } + cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}} + cfg.Batch = 256 + opts := grpcModelOpts(cfg, "/tmp/models") + Expect(opts.NBatch).To(BeEquivalentTo(256)) + }) + }) +}) diff --git a/core/backend/options.go b/core/backend/options.go index 528c10e525a6..d3ccb2f423c1 100644 --- a/core/backend/options.go +++ b/core/backend/options.go @@ -191,7 +191,10 @@ func EffectiveBatchSize(c config.ModelConfig) int { if ctx := EffectiveContextSize(c); singlePass && ctx > DefaultBatchSize { return ctx } - return DefaultBatchSize + // Hardware-tuned default when the config leaves batch unset (e.g. a larger + // physical batch lifts MoE prefill on Blackwell). Explicit `batch:` (handled + // above) always overrides this. See hardware_defaults.go. + return hardwareDefaultBatchSize(DefaultBatchSize) } func grpcModelOpts(c config.ModelConfig, modelPath string) *pb.ModelOptions { diff --git a/core/backend/options_internal_test.go b/core/backend/options_internal_test.go index 022d7b1d9ec3..7c5b3dad6843 100644 --- a/core/backend/options_internal_test.go +++ b/core/backend/options_internal_test.go @@ -103,6 +103,18 @@ var _ = Describe("grpcModelOpts NBatch", func() { threads := 1 ctx := 4096 + // Pin the hardware seam off so these baseline expectations are + // deterministic regardless of the host GPU. Blackwell behavior is covered + // in hardware_defaults_internal_test.go. + var origDetect func() bool + BeforeEach(func() { + origDetect = detectBlackwellGPU + detectBlackwellGPU = func() bool { return false } + }) + AfterEach(func() { + detectBlackwellGPU = origDetect + }) + It("defaults to 512 for an ordinary model", func() { cfg := config.ModelConfig{Threads: &threads, LLMConfig: config.LLMConfig{ContextSize: &ctx}} opts := grpcModelOpts(cfg, "/tmp/models") diff --git a/core/gallery/importers/importers_test.go b/core/gallery/importers/importers_test.go index ed808ce37ff9..47b6218362ee 100644 --- a/core/gallery/importers/importers_test.go +++ b/core/gallery/importers/importers_test.go @@ -154,6 +154,19 @@ var _ = Describe("DiscoverModelConfig", func() { Expect(err).ToNot(HaveOccurred()) Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: mlx-vlm")) }) + + It("should use llama-cpp-localai-paged backend when specified as a drop-in", func() { + // The paged variant is a curated AdditionalBackends() drop-in: the + // llama-cpp pipeline matches (the .gguf URI), and the backend + // preference is honoured in the emitted YAML. + uri := "https://example.com/my-model.gguf" + preferences := json.RawMessage(`{"backend": "llama-cpp-localai-paged"}`) + + modelConfig, err := importers.DiscoverModelConfig(uri, preferences) + + Expect(err).ToNot(HaveOccurred()) + Expect(modelConfig.ConfigFile).To(ContainSubstring("backend: llama-cpp-localai-paged")) + }) }) Context("with HuggingFace URI formats", func() { @@ -288,7 +301,7 @@ var _ = Describe("DiscoverModelConfig", func() { names = append(names, e.Name) modalities = append(modalities, e.Modality) } - Expect(names).To(ContainElements("ik-llama-cpp", "turboquant")) + Expect(names).To(ContainElements("ik-llama-cpp", "turboquant", "llama-cpp-localai-paged")) for _, m := range modalities { Expect(m).To(Equal("text")) } diff --git a/core/gallery/importers/llama-cpp.go b/core/gallery/importers/llama-cpp.go index 5797e63526b3..f315d55cbcdf 100644 --- a/core/gallery/importers/llama-cpp.go +++ b/core/gallery/importers/llama-cpp.go @@ -37,6 +37,7 @@ func (i *LlamaCPPImporter) AdditionalBackends() []KnownBackendEntry { return []KnownBackendEntry{ {Name: "ik-llama-cpp", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with ik-quants"}, {Name: "turboquant", Modality: "text", Description: "GGUF drop-in replacement for llama-cpp with TurboQuant optimizations"}, + {Name: "llama-cpp-localai-paged", Modality: "text", Description: "Paged-attention llama.cpp (on-demand paged KV + decode-first prefill budget), tuned for NVFP4 on Blackwell/GB10"}, } } @@ -130,7 +131,7 @@ func (i *LlamaCPPImporter) Import(details Details) (gallery.ModelConfig, error) backend := "llama-cpp" if b, ok := preferencesMap["backend"].(string); ok { switch b { - case "ik-llama-cpp", "turboquant": + case "ik-llama-cpp", "turboquant", "llama-cpp-localai-paged": backend = b } } diff --git a/core/gallery/importers/llama-cpp_test.go b/core/gallery/importers/llama-cpp_test.go index e3f730945c1c..22420724ea43 100644 --- a/core/gallery/importers/llama-cpp_test.go +++ b/core/gallery/importers/llama-cpp_test.go @@ -473,7 +473,7 @@ var _ = Describe("LlamaCPPImporter", func() { }) Context("AdditionalBackends", func() { - It("advertises ik-llama-cpp and turboquant as drop-in replacements", func() { + It("advertises ik-llama-cpp, turboquant and llama-cpp-localai-paged as drop-in replacements", func() { entries := importer.AdditionalBackends() names := make([]string, 0, len(entries)) @@ -482,7 +482,7 @@ var _ = Describe("LlamaCPPImporter", func() { names = append(names, e.Name) byName[e.Name] = e } - Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant")) + Expect(names).To(ConsistOf("ik-llama-cpp", "turboquant", "llama-cpp-localai-paged")) ik := byName["ik-llama-cpp"] Expect(ik.Modality).To(Equal("text")) @@ -491,6 +491,10 @@ var _ = Describe("LlamaCPPImporter", func() { tq := byName["turboquant"] Expect(tq.Modality).To(Equal("text")) Expect(tq.Description).NotTo(BeEmpty()) + + paged := byName["llama-cpp-localai-paged"] + Expect(paged.Modality).To(Equal("text")) + Expect(paged.Description).NotTo(BeEmpty()) }) }) }) diff --git a/docs/content/features/backends.md b/docs/content/features/backends.md index 4b7445a98863..84a6650db2c5 100644 --- a/docs/content/features/backends.md +++ b/docs/content/features/backends.md @@ -125,6 +125,7 @@ For getting started, see the available backends in LocalAI here: https://github. LocalAI supports various types of backends: - **LLM Backends**: For running language models (e.g., llama.cpp, vLLM, SGLang, transformers, MLX) + - **`llama-cpp-localai-paged`**: LocalAI's paged-attention llama.cpp variant - on-demand paged KV cache plus a decode-first prefill budget, tuned for NVFP4 dense/MoE on Blackwell/GB10. Same upstream llama.cpp pin as the stock `llama-cpp` backend, reusing its gRPC server; the paged engine is enabled per-model via the `paged_kv` / `max_batch_tokens` options. - **Speech-to-Text Backends**: For transcription (e.g., whisper.cpp, parakeet.cpp, faster-whisper, NeMo) - **Text-to-Speech Backends**: For speech synthesis (e.g., piper, Kokoro, VibeVoice, Qwen3-TTS) - **Sound Generation Backends**: For music and audio generation (e.g., ACE-Step) diff --git a/docs/superpowers/plans/2026-06-30-gb10-parity-reopen.md b/docs/superpowers/plans/2026-06-30-gb10-parity-reopen.md new file mode 100644 index 000000000000..9a31f4e2729b --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-gb10-parity-reopen.md @@ -0,0 +1,633 @@ +# GB10 Parity Reopen Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Reopen the GB10 vLLM-parity investigation with clean provenance, then execute gated W4A16, GDN, MoE fan-in, serving, and glue-fusion workstreams only when their entry criteria are met. + +**Architecture:** The plan is phased. Phase 0 creates trustworthy baseline artifacts and command provenance; later phases are fork-first llama.cpp changes regenerated into the LocalAI patch stack. Every branch has a kill gate, and subagents are used only for independent file or artifact ownership. + +**Tech Stack:** LocalAI docs and patch stack, `mudler/llama.cpp:localai-paged`, ggml CUDA kernels, vLLM 0.23.0 on DGX GB10, CUDA 13, Nsight Systems, LocalAI benchmark artifacts. + +--- + +## File Structure + +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_REOPEN_SPEC.md` + - Keep the high-level scope in sync when Phase 0 changes the evidence. +- Create: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - Record Phase 0 commands, preflight state, source SHAs, artifact paths, and baseline numbers. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + - W4A16 grouped MoE prefill kernel tuning. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` + - W4A16 API and tuning switches. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + - `ggml_cuda_mul_mat_id` dispatch, MoE fan-in fusions, graph behavior. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` + - W4A16/FP4 prefill routing thresholds. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + - GDN M5 follow-up variants. +- Modify later, fork-first: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` + - MoE weighted fan-in graph shape if a fused op is pursued. +- Modify later: `backend/cpp/llama-cpp-localai-paged/patches/paged/*.patch` + - Generated only from fork commits using `git format-patch`; never edited directly. + +## Task 1: Phase 0 Preflight And Run Directory + +**Files:** +- Create: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Confirm the current worktree state** + +Run: + +```bash +git status --short --branch +git log --oneline --decorate --max-count=5 +``` + +Expected: + +```text +## worktree-feat+paged-attention...origin/worktree-feat+paged-attention [ahead 2] +?? .claude/ +``` + +- [x] **Step 2: Run DGX preflight without starting workloads** + +Run: + +```bash +ssh dgx.casa 'set -e +echo "HOST=$(hostname)" +echo "--- docker ps ---" +docker ps --format "{{.ID}} {{.Names}} {{.Image}} {{.Status}}" || true +echo "--- compute apps ---" +nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader || true +echo "--- gpu lock ---" +if [ -e ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo NO_OWNER; fi +echo "--- source states ---" +git -C ~/llama-paged-fork status --short --branch 2>/dev/null || true +git -C ~/llama-paged-dev status --short --branch 2>/dev/null || true +' +``` + +Expected: + +```text +docker ps has no running containers +compute apps has no rows +gpu lock is FREE or NO_OWNER +DGX source states are recorded, even if dirty +``` + +- [x] **Step 3: Create the Phase 0 artifact directory on DGX** + +Run: + +```bash +ssh dgx.casa 'set -e +mkdir -p ~/bench/reopen_phase0 +date -u +%Y-%m-%dT%H:%M:%SZ > ~/bench/reopen_phase0/created_utc.txt +hostname > ~/bench/reopen_phase0/hostname.txt +docker ps --format "{{.ID}} {{.Names}} {{.Image}} {{.Status}}" > ~/bench/reopen_phase0/docker_ps.txt +nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv,noheader > ~/bench/reopen_phase0/compute_apps.txt || true +if [ -e ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner > ~/bench/reopen_phase0/gpu_lock_owner.txt; else echo NO_OWNER > ~/bench/reopen_phase0/gpu_lock_owner.txt; fi +' +``` + +Expected: + +```text +~/bench/reopen_phase0 exists and contains created_utc.txt, hostname.txt, docker_ps.txt, compute_apps.txt, gpu_lock_owner.txt +``` + +- [x] **Step 4: Write the initial Phase 0 results document from captured values** + +Run: + +```bash +DGX_HOST=$(ssh dgx.casa 'cat ~/bench/reopen_phase0/hostname.txt') +DGX_DOCKER=$(ssh dgx.casa 'if [ -s ~/bench/reopen_phase0/docker_ps.txt ]; then tr "\n" "; " < ~/bench/reopen_phase0/docker_ps.txt; else echo "none"; fi') +DGX_COMPUTE=$(ssh dgx.casa 'if [ -s ~/bench/reopen_phase0/compute_apps.txt ]; then tr "\n" "; " < ~/bench/reopen_phase0/compute_apps.txt; else echo "none"; fi') +DGX_LOCK=$(ssh dgx.casa 'cat ~/bench/reopen_phase0/gpu_lock_owner.txt') +LOCALAI_SHA=$(git rev-parse HEAD) +LLAMA_SHA=$(git -C /home/mudler/_git/llama.cpp rev-parse HEAD) +cat > backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md < ~/bench/reopen_phase0/existing_artifact_extract.txt +cat ~/bench/reopen_phase0/existing_artifact_extract.txt +' +``` + +Expected: + +```text +existing_artifact_extract.txt is created and shows CDEF, paged highN, and vLLM highN evidence. +``` + +- [x] **Step 2: Update Phase 0 results with artifact gaps** + +Add: + +```markdown +## Existing Artifact Gap Report + +- CDEF prefill is mixed harness: paged `llama-batched-bench`, vLLM server/h2h. +- Paged high-N difference method has artifact support under `~/highN_prof2`. +- vLLM 1078 t/s true GPU-steady decode is not yet backed by a self-contained + ntg16/ntg64 difference-method artifact in the inspected files. +- CDEF records a dev-tree `GIT_HEAD=a7d439e` while current shipped fork HEAD is + `51168c5ee`; this must be separated from current production-source baselines. +``` + +- [x] **Step 3: Commit Task 3** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +git commit -m "docs(paged): record phase0 artifact gaps" \ + -m "Record the existing benchmark artifact gaps that must be resolved before accepting the GB10 parity final-state claims." \ + -m "Assisted-by: Codex:gpt-5" +``` + +Expected: + +```text +Commit succeeds. +``` + +## Task 4: Clean Build And Canonical Gates + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Re-run DGX preflight immediately before build** + +Run: + +```bash +ssh dgx.casa 'set -e +test -z "$(docker ps -q)" +test -z "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | grep . || true)" +if [ -e ~/gpu_bench_lock/owner ]; then grep -q "^FREE" ~/gpu_bench_lock/owner; fi +' +``` + +Expected: + +```text +Exit code 0. +``` + +- [x] **Step 2: Start a detached clean build** + +Run: + +```bash +ssh dgx.casa 'set -e +mkdir -p ~/bench/reopen_phase0 +cat > ~/bench/reopen_phase0/build_clean.sh <<'"'"'SH'"'"' +#!/usr/bin/env bash +set -euo pipefail +cd "$HOME" +rm -rf "$HOME/llama-paged-reopen-clean" +git clone git@github.com:mudler/llama.cpp.git "$HOME/llama-paged-reopen-clean" +cd "$HOME/llama-paged-reopen-clean" +git checkout 51168c5eee2e35348d9006f0b2fab3dc6e7c01cc +git status --short --branch > "$HOME/bench/reopen_phase0/build_source_status.txt" +cmake -S . -B build-cuda \ + -DGGML_CUDA=ON \ + -DCMAKE_CUDA_ARCHITECTURES=121 \ + -DCMAKE_BUILD_TYPE=Release \ + -DLLAMA_CURL=OFF +cmake --build build-cuda --target llama-server llama-batched-bench llama-completion test-backend-ops -j"$(nproc)" +git rev-parse HEAD > "$HOME/bench/reopen_phase0/build_git_head.txt" +stat -c "%n %y" build-cuda/bin/llama-server build-cuda/bin/llama-batched-bench build-cuda/bin/llama-completion build-cuda/bin/test-backend-ops > "$HOME/bench/reopen_phase0/build_binary_mtimes.txt" +touch "$HOME/bench/reopen_phase0/build_clean.done" +SH +chmod +x ~/bench/reopen_phase0/build_clean.sh +rm -f ~/bench/reopen_phase0/build_clean.done +nohup ~/bench/reopen_phase0/build_clean.sh > ~/bench/reopen_phase0/build_clean.log 2>&1 & +echo $! > ~/bench/reopen_phase0/build_clean.pid +' +``` + +Expected: + +```text +Command returns quickly and writes build_clean.pid. +``` + +- [x] **Step 3: Poll build completion** + +Note: first build attempt started as PID `625392` and failed during CMake +configure because `nvcc` was not on PATH. DGX has +`/usr/local/cuda-13.0/bin/nvcc`; retry uses explicit `CUDACXX`. + +Retry build attempt started as PID `631100` and completed successfully. + +Run: + +```bash +ssh dgx.casa 'for i in $(seq 1 240); do + if [ -f ~/bench/reopen_phase0/build_clean.done ]; then + echo DONE + tail -20 ~/bench/reopen_phase0/build_clean.log + exit 0 + fi + if ! kill -0 "$(cat ~/bench/reopen_phase0/build_clean.pid)" 2>/dev/null; then + echo BUILD_EXITED_WITHOUT_DONE + tail -80 ~/bench/reopen_phase0/build_clean.log + exit 1 + fi + sleep 30 +done +echo BUILD_TIMEOUT +tail -80 ~/bench/reopen_phase0/build_clean.log +exit 2' +``` + +Expected: + +```text +DONE +``` + +- [x] **Step 4: Run canonical md5 gates** + +Run: + +```bash +ssh dgx.casa 'set -e +cd ~/llama-paged-reopen-clean/build-cuda/bin +L="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1" +MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf +DENSE=/home/mudler/bench/q36-27b-nvfp4.gguf +env $L ./llama-completion -m "$MOE" -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" ~/bench/reopen_phase0/paged_moe_prefill.txt 2>&1 +env $L ./llama-batched-bench -m "$DENSE" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > ~/bench/reopen_phase0/paged_dense_prefill.txt 2>&1 +grep -E "S_PP|^\\|" ~/bench/reopen_phase0/paged_moe_prefill.txt ~/bench/reopen_phase0/paged_dense_prefill.txt +' +``` + +Expected: + +```text +Both files contain S_PP rows for 512 and 2048. +``` + +- [x] **Step 2: Update Phase 0 results and commit** + +Record exact S_PP rows and artifact paths. + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +git commit -m "docs(paged): record phase0 prefill baseline" \ + -m "Record clean-source MoE and dense prefill baselines for the GB10 parity reopen." \ + -m "Assisted-by: Codex:gpt-5" +``` + +Expected: + +```text +Commit succeeds. +``` + +## Task 6: Decode Difference-Method Repro + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Dispatch a vLLM harness discovery subagent** + +Result: read-only subagent found prior harnesses +`/home/mudler/vllm_moe_nsys.sh` and `/home/mudler/vllm_moe_prof.py`, plus a +concrete `~/highN_vllm_diff` `NSEQ`/`GEN` command sequence using +`nsys profile --cuda-graph-trace=node`. + +Prompt: + +```text +Read-only task. On dgx.casa, inspect existing vLLM high-N profiling scripts/logs under ~/highN_vllm, ~/bench, and the installed vLLM package. Find the exact command sequence needed to produce a graph-node-traced ntg16/ntg64 difference-method decode artifact for vLLM comparable to paged highN_prof2. Do not run vLLM, nsys, servers, builds, or benchmarks. Return commands and artifact paths only. +``` + +Expected: + +```text +Subagent returns a concrete vLLM command sequence or reports that no prior harness exists. +``` + +- [x] **Step 2: Run paged graph-node-traced decode difference-method** + +Run only after DGX preflight passes: + +```bash +ssh dgx.casa 'set -e +test -z "$(docker ps -q)" +test -z "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | grep . || true)" +if [ -e ~/gpu_bench_lock/owner ]; then grep -q "^FREE" ~/gpu_bench_lock/owner; fi +mkdir -p ~/bench/reopen_phase0/paged_decode_nsys +cd ~/llama-paged-reopen-clean/build-cuda/bin +L="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1" +MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf +for NTG in 16 64; do + env $L nsys profile --force-overwrite=true --cuda-graph-trace=node \ + -o ~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg${NTG} \ + ./llama-batched-bench -m "$MOE" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg "$NTG" -npl 256 \ + > ~/bench/reopen_phase0/paged_decode_nsys/paged_moe_n256_ntg${NTG}.bench.log 2>&1 +done +' +``` + +Expected: + +```text +Two `.nsys-rep` files and two `.bench.log` files exist. +``` + +- [x] **Step 3: Run vLLM graph-node-traced decode difference-method** + +Use the exact command sequence from Step 1. Required properties: + +```text +nsys profile uses --cuda-graph-trace=node +N is 128 or 256 +ntg 16 and ntg 64 artifacts are both captured +model is /home/mudler/bench/q36-35b-a3b-nvfp4-vllm +vLLM version is recorded as 0.23.0 or the actual installed value +``` + +Expected: + +```text +Two vLLM graph-node-traced artifacts exist and can be reduced by the difference method. +``` + +- [x] **Step 4: Update Phase 0 results and commit** + +Record paged and vLLM tokens/s using: + +```text +per-token-linear decode throughput = generated token delta / (ntg64 wall - ntg16 wall) +``` + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +git commit -m "docs(paged): record phase0 decode repro" \ + -m "Record graph-node-traced paged and vLLM decode difference-method artifacts for the GB10 parity reopen." \ + -m "Assisted-by: Codex:gpt-5" +``` + +Expected: + +```text +Commit succeeds only after both engines have comparable artifacts. +``` + +## Task 7: Phase 1 W4A16 Kill-Gate Plan + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Later fork-first changes in `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] **Step 1: Run current W4A16 forced baseline** + +Run: + +```bash +ssh dgx.casa 'set -e +cd ~/llama-paged-reopen-clean/build-cuda/bin +MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf +LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1" +env $LBASE ./llama-batched-bench -m "$MOE" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > ~/bench/reopen_phase0/w4a16_off.txt 2>&1 +env $LBASE LLAMA_W4A16_PREFILL_M=64 LLAMA_W4A16_DEBUG=1 ./llama-batched-bench -m "$MOE" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > ~/bench/reopen_phase0/w4a16_on_thr64.txt 2>&1 +grep -E "S_PP|^\\||W4A16" ~/bench/reopen_phase0/w4a16_off.txt ~/bench/reopen_phase0/w4a16_on_thr64.txt +' +``` + +Expected: + +```text +Artifacts prove current clean W4A16 delta against FP4-MMQ. +``` + +- [x] **Step 2: Decide first W4A16 implementation target** + +Selected target: Option B, device-side or cached tile metadata. The clean +forced W4A16 run without debug remains about 43-45% slower than default +FP4-MMQ, while the debug artifact shows repeated ragged tile-map construction +with `n_tiles=139..282` and `multi_tile_experts=7..21`. Source inspection shows +host-built `h_tile_expert`, `h_tile_row0`, and `h_tile_rows` copied to device +for each grouped W4A16 launch. + +Use nsys or debug logs to choose exactly one first target: + +```text +Option A: fuse/remove f32->bf16 cast pre-pass +Option B: device-side tile metadata +Option C: 16-byte weight staging/shared-memory layout +Option D: tile-shape retune for ragged expert M +``` + +Expected: + +```text +Only one implementation target is selected for the first fork commit. +``` + +- [x] **Step 3: Stop before kernel edits if Phase 0 is incomplete** + +Phase 0 Tasks 1-6 are complete. No kernel edits were made during Phase 0. + +Expected: + +```text +No W4A16 code edit begins unless Tasks 1-6 are complete or explicitly waived by the maintainer. +``` diff --git a/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md b/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md new file mode 100644 index 000000000000..d63c1092ff7e --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-serving-nsys-phase6.md @@ -0,0 +1,221 @@ +# Phase 6: Serving nsys Gap Classifier + +**Status:** Completed. Phase 6 kept no source changes. + +**Scope:** Measurement-first. Do not edit llama.cpp source in this phase unless +the serving profiles identify a small, bit-exact, fork-first patch candidate. +Every candidate must pass the md5 and op gates before it can be mirrored into +LocalAI patches. + +**Goal:** Classify the remaining GB10 MoE serving gap against vLLM by profiling a +steady serving window for both engines, then pick exactly one next lever from +measured evidence. + +## Safety Gates + +- Canonical paged MoE greedy md5 must stay `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense greedy md5 must stay `5951a5b4d624ce891e22ab5fca9bc439`. +- If a patch touches W4A16, forced `bm32` and `base` md5 must both stay + `07db32c2bcb78d17a43ed18bc22705cd`. +- If a patch touches `MUL_MAT_ID` routing or CUDA MoE kernels, run + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`. +- Patch promotion threshold: no semantic gate regression, no generated patch + hand-editing, and at least one measured serving bucket improvement that explains + a material share of the vLLM gap. +- Inference-safety rule: a candidate that changes CUDA routing, sampler inputs, + graph construction, or MoE kernels is not kept unless the md5 gates are rerun + from the clean candidate binary and still match the canonical values above. + Performance-only evidence is insufficient. + +## Checklist + +- [x] Confirm DGX is idle before running GPU work. + - Docker containers: `0`. + - Compute PIDs: `0`. + - Lock: `FREE released-by-claude-fp4norm-profile 1782828229`. + - GPU util: `0%`. +- [x] Locate/reuse the existing llama.cpp and vLLM serving harnesses. + - Both-engine h2h harness: `/home/mudler/bench/combined_definitive.sh`. + - Current OpenAI completions load client: `/home/mudler/bench/h2h_cli3.py`. + - Paged serving command shape: `llama-server -c 262144 --parallel 256 -b 2048 + -ub 512 -ngl 99 -fa on --port 8090 --no-webui` with + `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1`. + - vLLM serving command shape: `vllm serve + /home/mudler/bench/q36-35b-a3b-nvfp4-vllm --served-model-name q36 + --gpu-memory-utilization 0.85 --max-model-len 4096 --max-num-seqs 256 + --port 8000 --tensor-parallel-size 1`. + - Existing static high-N nsys harnesses: + `/home/mudler/highN_nsys.sh`, `/home/mudler/vllm_moe_nsys.sh`, and + `/home/mudler/vllm_moe_prof.py`. +- [x] Inspect `MUL_MAT_ID` fallback predicates before patching. + - `LLAMA_MOE_FORCE_GRAPHS=1` is used by harnesses but is not an implemented + hard-force predicate in the inspected CUDA path. + - The host fallback still has stream synchronizations after device-to-host ids + copy and after sorted-id upload. + - Highest-risk condition to verify by nsys: NVFP4 Blackwell MoE with token + count above `LLAMA_FP4_PREFILL_M` or `LLAMA_W4A16_PREFILL_M` can route away + from grouped MMQ into the host fallback. +- [x] Build exact fork head on DGX for Phase 6 profiling. + - Source mirror: `/home/mudler/llama-phase6-source`. + - Fork head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. + - Build config: CUDA Release, `CMAKE_CUDA_COMPILER=/usr/local/cuda-13.0/bin/nvcc`, + `CMAKE_CUDA_ARCHITECTURES=121`. + - Built targets: `llama-server`, `llama-batched-bench`, `llama-completion`, + `test-backend-ops`. +- [x] Run canonical md5 gates before serving profiling. + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Note: an older `-no-cnv` command shape produced different text hashes; the + Phase 0 canonical command without `-no-cnv` matches the recorded gates. +- [x] Capture a llama-server steady serving nsys window with + `--cuda-graph-trace=node`. +- [x] Capture a comparable vLLM steady serving nsys window with + `--cuda-graph-trace=node`. +- [x] Reduce both profiles into kernel/API buckets: + `MUL_MAT_ID`, FA decode, gated_delta_net, bf16 projections, activation-quant, + sampling/logits, and CUDA API sync/memcpy. +- [x] Count `cudaStreamSynchronize` and host copies between `MUL_MAT_ID` launches + to confirm or reject the host-sync fallback risk. +- [x] Compare serving-narrow vs static-wide vs vLLM and select one next lever: + H1 MoE GEMM collapse/fallback, H2 paged FA ragged imbalance, H3 GDN narrow + occupancy, H4 projection/ragged batch efficiency, H5 sampling/logits, or H6 + activation quant. +- [x] If a source change is justified, implement fork-first in + `/home/mudler/_git/llama.cpp`, keep it stacked as one incremental commit, then + mirror with `git format-patch` into LocalAI. +- [x] Run the safety gates and update this file plus + `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md`. + +## Current Decision + +W4A16 prefill was not the highest-leverage path for Phase 6. The accepted Phase 1-4 +changes improved forced W4A16 from roughly `1314/1339` to `1466/1495` S_PP, but +default FP4-MMQ remains around `2303/2423`. The next evidence gate is serving +nsys, because the committed lever map says the residual gap is in real +continuous serving, not the static wide decode kernel regime. + +## Expected Output + +Artifacts should land under a dated `~/bench/phase6_serving_nsys/` directory on +DGX. For each engine, keep: + +- server command and client command logs, +- nsys `.nsys-rep` files, +- exported `cuda_gpu_kern_sum`, `cuda_api_sum`, and any trace table needed to + count syncs between MoE kernels, +- reduced CSV/markdown bucket summary, +- md5/op gate logs for any patch candidate. + +## Results + +Artifacts: + +- llama.cpp serving nsys: + `/home/mudler/bench/phase6_serving_nsys/llama_server_n128/`. +- vLLM serving nsys: + `/home/mudler/bench/phase6_serving_nsys/vllm_server_n128/`. +- rejected sampler short-circuit gates: + `/home/mudler/bench/phase6_serving_nsys/sampler_shortcircuit_gates/`. +- rejected sampler short-circuit serving A/B: + `/home/mudler/bench/phase6_serving_nsys/sampler_shortcircuit_ab/`. + +Serving result at 128 clients, `ptok=128`, `gen=128`: + +| Engine | decode tok/s/seq | decode agg tok/s | prefill tok/s | +|--------|------------------|------------------|---------------| +| llama.cpp under nsys | 4.05 | 591.0 | 1567.4 | +| vLLM under nsys | 6.95 | 961.1 | 5073.6 | + +llama.cpp nsys top buckets: + +- `gated_delta_net_cuda`: 33.7% GPU kernel time, 10.21s. +- NVFP4 `mul_mat_q`: 24.3% + 5.5% for the largest grouped variants, 9.04s + combined. +- `quantize_mmq_nvfp4`: 2.7%, 0.81s. +- `flash_attn_tile`: 1.3%, 0.38s. +- CUDA API: `cudaStreamSynchronize` 76.5% API time, 23.66s over 106585 calls. + 8028 synchronizes followed `cudaMemcpyAsync` and summed 21.41s. + +vLLM nsys top buckets: + +- `fused_recurrent_gated_delta_rule_packed_decode_kernel`: 16.6%, 8.95s. +- `marlin_moe_wna16::Marlin`: 11.9% plus smaller Marlin-MoE variants. +- `flash_fwd_splitkv_kernel`: 0.6% + 0.1% visible split-K FA decode rows. +- CUDA API has startup/module-load noise in the delayed profile, so use the + kernel buckets and h2h result as the primary comparison. + +Decision: + +- The sync-heavy path is real, but source inspection shows the initial + `MUL_MAT_ID` host-fallback hypothesis is incomplete for this run: + `LLAMA_FP4_PREFILL_M` and `LLAMA_W4A16_PREFILL_M` were unset, so grouped MMQ + should stay enabled. The largest sync signature instead matches thousands of + small synchronous tensor uploads, especially backend sampler inputs. +- A fork-first sampler short-circuit experiment skipped backend distribution + sampling when prior backend filters collapsed the candidate set to one token + (`temperature=0` path). It passed gates: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. + - `MUL_MAT_ID`: `806/806` on CUDA0. +- The sampler experiment was rejected: no-nsys serving reps were `4.19` and + `3.55` tok/s/seq, not a material improvement over the known baseline band. + The fork patch was reverted; no commit and no LocalAI patch were created. + +Next lever: + +- H3/H1 combined, but with H3 checked first: llama.cpp spends 33.7% of GPU time + in GDN decode versus vLLM's 16.6%, and vLLM's aggregate decode remains 1.63x + faster in the same serving shape. + +## Follow-up: GDN Env Grid + +Artifact: `/home/mudler/bench/phase6_serving_nsys/gdn_grid/`. + +Shape: `n=128`, `ptok=128`, `gen=64`. + +| Setting | decode tok/s/seq | decode agg tok/s | Decision | +|---------|------------------|------------------|----------| +| default | 3.91 | 647.9 | baseline | +| `GDN_NW=4 GDN_CPW=1` | 3.80 | 628.9 | reject | +| `GDN_NW=8 GDN_CPW=2` | 3.94 | 624.5 | reject | +| `GDN_NW=8 GDN_CPW=4` | 3.91 | 647.6 | reject | +| `GDN_NW=8 GDN_CPW=8` | 4.00 | 636.9 | no material win | +| `GDN_NW=16 GDN_CPW=4` | 3.85 | 637.5 | reject | +| `GDN_NW=16 GDN_CPW=8` | 3.96 | 652.0 | no material win | + +Result: rejected as an env-only lever. Existing GDN geometry variants are too +close in the serving gate to justify a source change. Next focus is the largest +remaining differentiating bucket: llama.cpp NVFP4 grouped `mul_mat_q` versus +vLLM Marlin-MoE. + +## Follow-up: MoE MMQ Tile Env Grid + +Artifact: `/home/mudler/bench/phase6_serving_nsys/mmq_grid/`. + +Shape: `n=128`, `ptok=128`, `gen=64`. + +| Setting | decode tok/s/seq | decode agg tok/s | Decision | +|---------|------------------|------------------|----------| +| default | 3.90 | 645.3 | baseline | +| `LLAMA_MOE_AUTO_TILE=0` | 3.90 | 655.3 | tied/no material win | +| `LLAMA_MOE_DECODE_TILE=32` | 3.82 | 635.9 | reject | +| `LLAMA_MOE_DECODE_TILE=48` | 3.81 | 637.3 | reject | +| `LLAMA_MOE_DECODE_TILE=96` | 3.84 | 642.8 | reject | +| `LLAMA_MOE_DECODE_TILE=128` | 3.84 | 640.6 | reject | +| `LLAMA_MOE_MMQ_X=32` | 3.76 | 642.0 | reject; prefill worsened | + +Result: rejected as an env-only lever. Existing grouped-MMQ tile knobs do not +materially close the serving gap, so a selector-only source patch is not +justified. + +## Completion + +Phase 6 completed as a classifier, not as a source patch phase: + +- Accepted source patches before Phase 6 remained intact through fork head + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3`. +- The sampler short-circuit candidate passed inference gates but failed the + serving performance gate, so it was reverted and not mirrored. +- GDN and grouped-MMQ env grids did not clear the material-improvement threshold. +- No LocalAI patch was generated for Phase 6. The next phase must start from a + clean fork and keep the same md5/op gates before any source candidate is kept. diff --git a/docs/superpowers/plans/2026-06-30-serving-source-phase7.md b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md new file mode 100644 index 000000000000..6ca880cf9d4a --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-serving-source-phase7.md @@ -0,0 +1,390 @@ +# Phase 7: Serving Source Candidate Scope + +**Status:** Test-gate patches landed. Two production CUDA fusion candidates +rejected after DGX gates and serving A/B. + +**Goal:** Select one maintainable source candidate for the remaining GB10 MoE +serving gap, then implement only if it can be gated for inference correctness and +measured against a bucket that Phase 6 proved relevant. + +## Entry State + +- llama.cpp fork: `/home/mudler/_git/llama.cpp` +- Required branch: `localai-paged` +- Required clean head: `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` +- LocalAI patch mirror count before Phase 7: `41`, through patch `0050` +- DGX mirror used by Phase 6: `/home/mudler/llama-phase6-source` + +## Required Safety Gates + +- Before DGX work: + - `docker ps -q | wc -l` must be `0`. + - `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty. + - `~/gpu_bench_lock/owner` must be absent or start with `FREE`. + - No `local-ai-worker` container may be running. +- Before keeping any source patch: + - MoE greedy md5 must be `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense greedy md5 must be `5951a5b4d624ce891e22ab5fca9bc439`. + - If W4A16 is touched, forced `bm32` and `base` md5 must both be + `07db32c2bcb78d17a43ed18bc22705cd`. + - If `MUL_MAT_ID` routing or CUDA MoE kernels are touched, run + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806`. +- Patch handling: + - Source changes are fork-first in `/home/mudler/_git/llama.cpp`. + - Keep each patch incremental and additive, with helper functions preferred + over invasive rewrites. + - Regenerate LocalAI patches with `git format-patch`; do not hand-edit + generated patch files. + +## Candidate Tracks + +### Track A: Structural MoE Decode Kernel + +Phase 6 evidence: grouped NVFP4 `mul_mat_q` accounts for roughly 30% of llama.cpp +GPU kernel time under serving, while vLLM's Marlin-MoE bucket is materially +smaller in the same workload class. + +The candidate must identify a bounded change in the current `MUL_MAT_ID` or +grouped-MMQ path that reduces actual serving bucket time. Selector-only tile +retuning is rejected unless new evidence differs from the Phase 6 MMQ grid. + +Selected first candidate: + +- Add a batched CUDA path that fuses MoE SWIGLU with the NVFP4 activation + quantization feeding the **down** `MUL_MAT_ID`. +- Current graph shape: + `ffn_moe_gate_up` `MUL_MAT_ID` -> gate/up views -> `ggml_swiglu_split` -> + `ffn_moe_down` `MUL_MAT_ID`. +- Target: remove or reduce the separate f32 SWIGLU intermediate write/read and + `quantize_mmq_nvfp4` pass for the down projection while preserving the existing + grouped-MMQ kernel and accumulation order. +- Keep scope to CUDA, Blackwell native FP4, `GGML_TYPE_NVFP4`, merged gate/up + MoE, down projection only, no bias/clamp/OAI/GEGLU. + +Important finding: + +- Existing CUDA `MUL_MAT_ID + GLU` fusion is vector-only. The fusion predicates + reject `MUL_MAT_ID` when `dst->ne[2] != 1`, so it does not cover the Phase 6 + multi-token serving shape. +- Existing `MUL_MAT_ID_FUSION` tests cover add/mul after `MUL_MAT_ID`, not the + gate_up/SWIGLU/down chain. Do not treat them as sufficient for this candidate. + +Initial files to inspect: + +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` +- vLLM Marlin-MoE implementation files in the local vLLM checkout/package. + +### Track B: Serving Input And Sampler Synchronization (Deferred) + +Phase 6 evidence: `cudaStreamSynchronize` dominates CUDA API time, and many +syncs follow small `cudaMemcpyAsync` calls. The greedy sampler short-circuit +passed correctness gates but did not improve serving, so this track needs a +workload where sampler/input upload cost is proven relevant before patching. + +Initial files to inspect: + +- `/home/mudler/_git/llama.cpp/src/llama-sampling.cpp` +- `/home/mudler/_git/llama.cpp/src/llama-context.cpp` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-backend.cpp` +- CUDA backend tensor-set paths under `/home/mudler/_git/llama.cpp/ggml/src/`. + +Selected secondary candidate: + +- Cache backend logit-bias tensor uploads in + `/home/mudler/_git/llama.cpp/src/llama-sampler.cpp` + `llama_sampler_logit_bias_backend_set_input()`. +- Today the sampler rebuilds and uploads `logit_bias` and `logit_idxs` every + decode step. Those uploads hit the CUDA tensor-set path with immediate + `cudaStreamSynchronize`. +- This is narrow and maintainable, but it is not the default greedy parity + lever. Only promote it if a non-greedy backend-sampling workload with non-empty + `logit_bias` proves the sync bucket is material. + +Challenge result: + +- Demoted from default-parity scope. Default serving does not enable + backend sampling, and no-bias greedy requests do not hit the logit-bias upload + path. +- The path is real but niche: `llama_sampler_logit_bias_backend_set_input()` + rebuilds and uploads `logit_bias` and `logit_idxs` every active decode step, + and CUDA `ggml_backend_tensor_set()` synchronizes after each upload. +- Only pursue as a separate backend-sampling feature when the workload is + `backend_sampling=true` with non-empty `logit_bias` or `ignore_eos=true`, + preferably with a large synthetic bias list and nsys proof that the small H2D + syncs are material. + +### Track C: Deterministic MoE Weighted-Combine Fusion + +Selected next candidate after rejecting Track A's SWIGLU-down shortcut and +demoting Track B: + +- Fuse the post-down MoE combine in `src/llama-graph.cpp`: + `ffn_moe_down` -> optional `ggml_mul(experts, weights)` -> + rank-ordered `ggml_add` fan-in. +- Target tensor shapes: + - experts: `[n_embd, n_expert_used, n_tokens]` + - weights: `[1, n_expert_used, n_tokens]` + - output: `[n_embd, n_tokens]` +- Preserve exact current arithmetic order: + 1. compute each rank's f32 product as the existing `ggml_mul` would, + 2. accumulate ranks in order `0, 1, ..., n_expert_used - 1`, + 3. avoid atomics and expert-sorted accumulation. +- File targets: + - `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp`: add + `MOE_WEIGHTED_COMBINE` whole-graph gate first. + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu`: add a + narrow fusion recognizer only after the test gate exists. + - CUDA helper file under `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/` + for the fused weighted-combine kernel if the graph-fusion hook is viable. + +Why this is safer than Track A's rejected shortcut: + +- It does not touch SWIGLU, NVFP4 packing, activation quantization, or + `MUL_MAT_ID` K-reduction. +- vLLM's Marlin-MoE path also keeps GEMM1, activation, GEMM2, then reduce; it + does not validate a SWIGLU-down quantization shortcut. +- The lever is structural: fewer launches and memory passes around existing f32 + post-GEMM tensors, with exact rank-order arithmetic as the md5 target. + +Required gates: + +- `test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1`. +- `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` remains `806/806`. +- Canonical md5 gates remain: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Serving A/B only after gates, using `n=128`, `ptok=128`, `gen=64`, + `/v1/completions`, `--no-cache`, plus nsys evidence that + `ffn_moe_weighted`/add fan-in time drops. + +Required workload: + +- Include a non-greedy serving shape if the patch targets sampler randomness or + probability upload behavior. +- Preserve the canonical greedy md5 gates even if the optimization targets + non-greedy serving. + +## Decision Gate + +Only one track may enter implementation at a time. Promote a candidate from scope +to implementation when all are true: + +- It has an exact file/function target. +- It is additive enough to minimize upstream conflicts. +- It has a direct measurement bucket from Phase 6 or a fresh bounded profile. +- It has a clear rollback path. +- It passes the md5/op gates before any performance result is accepted. + +## Checklist + +- [x] Close remaining Phase 6 explorer agents or capture their final findings. +- [x] Reconfirm DGX idle state before any new benchmark. + - Docker containers: `0`. + - `local-ai-worker`: `0`. + - Compute PIDs: `0`. + - Lock: `FREE released-by-codex-phase6-mmq-grid 1782860601`. +- [x] Pick Track A or Track B from concrete code evidence. + - Primary: Track A, batched MoE SWIGLU -> NVFP4 down-input quantization. + - Secondary: Track B, backend logit-bias upload cache for non-greedy workloads. +- [x] Run baseline gates from the clean candidate build. + - Artifact: `/home/mudler/bench/phase7_source_scope/`. + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Baseline `MUL_MAT_ID`: `806/806`. +- [x] Implement one fork-first incremental patch. + - Fork commit: `cd56cf037` (`test(paged): cover MoE swiglu down chain`). + - LocalAI patch: `0051-test-paged-cover-MoE-swiglu-down-chain.patch`. + - Scope: test gate only; no production inference path changed. +- [x] Run md5/op gates before serving A/B. + - `MOE_SWIGLU_DOWN`: `7/7` on CUDA0. + - Serving A/B is not applicable to this test-only patch. +- [x] Keep only if the serving bucket and h2h result improve materially. + - Rejected candidate: opt-in SWIGLU-down NVFP4 quantization fusion. + - Default path was protected behind `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1`. + - Default md5 gates stayed canonical, but the opt-in paged-MoE md5 changed + to the non-paged namespace (`07db32c2bcb78d17a43ed18bc22705cd`). + - Serving A/B was flat: default `decode_agg_tps=657.1`, + `decode_perseq_tps=3.92`, `prefill_tps=1456.0`; opt-in + `decode_agg_tps=667.4`, `decode_perseq_tps=3.88`, `prefill_tps=1462.9`. +- [x] Regenerate LocalAI patch stack and update docs if kept. + - No production patch kept; only docs updated for the rejected candidate. +- [x] Challenge remaining Track B and vLLM evidence before starting the next + patch. + - Track B is deferred to a backend-sampling/logit-bias workload. + - vLLM confirms GEMM1 -> activation -> GEMM2 -> reduce; no SWIGLU-down + shortcut to copy. + - Next candidate: deterministic post-down MoE weighted-combine fusion. +- [x] Add `MOE_WEIGHTED_COMBINE` test gate in the fork before production code. + - Fork commit: `3ef7eb9e4` (`test(paged): cover MoE weighted combine chain`). + - LocalAI patch: `0052-test-paged-cover-MoE-weighted-combine-chain.patch`. + - DGX gate: `MOE_WEIGHTED_COMBINE` `7/7` on CUDA0. +- [x] Implement weighted-combine fusion only if the test gate is stable. + - Implemented as a fork-first candidate, then rejected after serving A/B. + - Rejected diff saved at + `/home/mudler/bench/phase7_source_scope/rejected-phase7-moe-weighted-combine-fusion.diff`. +- [x] Run op/md5 gates before serving A/B. + - `MOE_WEIGHTED_COMBINE`: `7/7`. + - `MUL_MAT_ID`: `806/806`. + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Nsight proof: enabled run showed `110` launches of + `k_moe_weighted_combine`; disabled run showed none. + - Serving A/B was flat: disabled `decode_agg_tps=417.5`, + fused `decode_agg_tps=417.0`. + +## Required Tests Before Track A Source Patch + +- Add or extend a whole-graph op test for the batched MoE gate_up/SWIGLU/down + chain. Shapes must include `type_a=NVFP4`, `n_mats=128`, `n_used=8`, + `m=768`, `k=2048`, and `n in {16, 33, 64, 128, 130, 200}`. + - Done in fork commit `cd56cf037`. +- Run `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` and require `806/806` + until a more specific op name is available. + - Baseline done before the test-gate patch. +- Run canonical MoE and dense greedy md5 gates before serving A/B: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. + - Baseline done before the test-gate patch. +- Run a mixed prompt/decode md5 gate (`ptok=512`, `gen=32`) because graph reuse + can hide bugs that a decode-only gate misses. + +## Patch 0051 Result + +Patch `0051` adds a whole-graph test named `MOE_SWIGLU_DOWN`. It covers the +merged MoE gate_up -> SWIGLU -> down projection chain and includes: + +- one small F32 wiring case, +- NVFP4 Qwen-style cases with `n_mats=128`, `n_used=8`, `n_ff=768`, + `n_embd=2048`, and `n_tokens in {16, 33, 64, 128, 130, 200}`. + +The first run used the inherited single-FP4-op tolerance (`2e-2`) and failed +consistently at roughly `0.0213-0.0218` NMSE. Root cause: this whole-graph gate +compounds two native-FP4 `MUL_MAT_ID` ops with SWIGLU between them, so the test +uses `2.5e-2` for Blackwell native-FP4 backends and keeps the F32 wiring case at +the stricter default tolerance. + +DGX result after the adjustment: + +- `test-backend-ops test -b CUDA0 -o MOE_SWIGLU_DOWN -j 1`: `7/7`. +- Patch mirror applies cleanly to base pin `0ed235ea2c17a19fc8238668653946721ed136fd` + and tree-matches fork head `cd56cf037`. +- Mirrored tree hash: `623b7cb008a929455ca3d9deae35494c02622fef`. + +## Rejected Production Candidate: SWIGLU-Down MMQ Fusion + +Attempted a fork-first CUDA patch that fused `GGML_OP_GLU(SWIGLU)` into the +NVFP4 activation quantization feeding the down-projection `MUL_MAT_ID`. The +patch kept the existing grouped-MMQ kernel and only replaced the separate f32 +SWIGLU write/read plus down-input quantize pass. + +Root-cause note from the first failed op gate: the fused quantizer initially used +the compact GLU output strides to read the split `gate`/`up` views. Those views +stride over the original merged gate/up tensor, so the NVFP4 cases read wrong +rows and failed at roughly `2.0` NMSE. Switching the fused quantizer to the +source-view strides fixed the focused op gate. + +Final DGX artifacts live under `/home/mudler/bench/phase7_source_scope/`: + +- Forced fusion op gate: + `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1 test-backend-ops test -b CUDA0 -o MOE_SWIGLU_DOWN -j 1` + -> `7/7`. +- Broad default op gate: + `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` -> `806/806`. +- Default inference md5 after protecting the fusion behind + `GGML_CUDA_FUSE_SWIGLU_DOWN_MMQ=1`: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Opt-in fusion inference md5: + - MoE: `07db32c2bcb78d17a43ed18bc22705cd` (not the canonical paged-MoE md5). + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Serving A/B, `n=128`, `ptok=128`, `gen=64`, `/v1/completions`, + `--no-cache`: + - default: `decode_agg_tps=657.1`, `decode_perseq_tps=3.92`, + `prefill_tps=1456.0`. + - opt-in: `decode_agg_tps=667.4`, `decode_perseq_tps=3.88`, + `prefill_tps=1462.9`. + +Verdict: reject the production patch. The opt-in path is not md5-safe for +paged-MoE and the bounded serving A/B is effectively flat. Do not spend more +time on this exact activation-quant fusion unless a future KL gate explicitly +allows a new paged-MoE md5 namespace and a profile shows a material bucket win. + +## Required Tests Before Track B Source Patch + +- Establish fixed-seed baseline output md5 and token-id parity for a + backend-sampling request with non-empty `logit_bias`. +- Include the canonical greedy MoE and dense md5 gates even though the workload + target is non-greedy. +- Run existing server completion tests covering backend sampling probabilities + and logit-bias behavior. + +## Patch 0052 Result + +Patch `0052` adds a whole-graph test named `MOE_WEIGHTED_COMBINE`. It covers the +post-down MoE combine candidate: + +`down MUL_MAT_ID -> router-weight ggml_mul -> rank-ordered expert views/adds`. + +Coverage: + +- one small F32 wiring case, +- NVFP4 Qwen-style cases with `n_mats=128`, `n_used=8`, `n_ff=768`, + `n_embd=2048`, and `n_tokens in {16, 33, 64, 128, 130, 200}`. + +DGX result: + +- `test-backend-ops test -b CUDA0 -o MOE_WEIGHTED_COMBINE -j 1`: `7/7`. + +This is a test-only patch and does not change the production inference path. + +## Rejected Production Candidate: MoE Weighted-Combine Fusion + +Attempted a fork-first CUDA fusion for the post-down MoE combine: + +`ffn_moe_down -> ggml_mul(experts, weights) -> VIEW ranks -> ADD fan-in`. + +The candidate added a narrow graph recognizer and a CUDA kernel that computes +each rank's f32 product and accumulates ranks in the same `0..n_used-1` order as +the existing add chain. It was default-on with +`LLAMA_MOE_NO_WEIGHTED_COMBINE_FUSION=1` as the rollback switch during +validation. + +Important debugging result: + +- The first serving profile did not show the new kernel. Root cause: the + recognizer's `ggml_can_fuse_subgraph()` op vector was interleaved as + `MUL, VIEW, VIEW, ADD...`, while the real graph order is + `MUL, VIEW..., ADD...`. +- After fixing the op vector, Nsight showed the enabled completion run launched + `k_moe_weighted_combine` `110` times and the disabled run launched it `0` + times. + +Final DGX artifacts live under `/home/mudler/bench/phase7_source_scope/`: + +- Focused gate: + `test_backend_ops_moe_weighted_combine_orderfix.txt` -> `7/7`. +- Broad MoE routing gate: + `test_backend_ops_mul_mat_id_weighted_combine_orderfix.txt` -> `806/806`. +- Canonical transcript md5 gates: + `weighted_combine_orderfix_gates_chat/` + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Nsight completion proof: + `weighted_combine_orderfix_nsys_completion/` + - disabled: no `k_moe_weighted_combine` kernels. + - fused: `110` `k_moe_weighted_combine` kernels. +- Serving A/B: + `weighted_combine_orderfix_serving_ab/` + - disabled: `decode_agg_tps=417.5`, `decode_perseq_tps=2.63`, + `prefill_tps=1345.2`. + - fused: `decode_agg_tps=417.0`, `decode_perseq_tps=2.63`, + `prefill_tps=1346.9`. + +Verdict: reject the production patch. It is md5-safe and demonstrably fires, but +the bounded serving result is flat, so the extra default CUDA path is not worth +the upstream conflict and maintenance cost. Keep patch `0052` as coverage for +future structural MoE work, but do not retry this exact post-down fan-in fusion +unless a profile shows `ffn_moe_weighted`/add fan-in as a material bucket under a +new workload. diff --git a/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md b/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md new file mode 100644 index 000000000000..2ecda4086fb3 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-kernel-shape-phase2.md @@ -0,0 +1,105 @@ +# W4A16 Kernel Shape Phase 2 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Attack the remaining W4A16 prefill gap at the grouped kernel body, not metadata. + +**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated. Keep W4A16 default-off unless `LLAMA_W4A16_PREFILL_M > 0`. + +## Task 1: Profile-Guided Target Selection + +- [x] Run `nsys` for default FP4-MMQ and forced W4A16 at `npp=512`. +- [x] Compare kernel attribution for metadata/cast/body costs. +- [x] Decide next implementation target from measured cost, not speculation. + +Result: `w4a16_grouped_kernel` is the dominant forced-W4A16 cost (`5231.667 ms`, `47.8%` of profiled GPU kernel time). `w4a16_cast_act_f32_bf16` is visible but much smaller (`517.195 ms`, `4.7%`). Phase 2 targets grouped-kernel tile shape/body first. + +## Task 2: Runtime Shape Selector + +**Files:** +- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] Add a small runtime selector for W4A16 grouped-kernel shape experiments. +- [x] Preserve the current `64x128` shape as the default path. +- [x] Add multiple candidate specializations behind an environment selector: a vLLM-inspired wider-`N` candidate, a ragged-M candidate, an occupancy candidate, and a deeper pipeline candidate. +- [x] Keep launch and shared-memory calculations template-safe for each specialization. + +## Task 3: DGX Validation And Kill Gate + +- [x] Build the fork on DGX from the updated source snapshot. +- [x] Run canonical paged MoE and dense greedy md5 gates after the final code change. +- [x] Confirm gate hashes match the established inferencing references before committing. +- [x] Run forced W4A16 A/B for default shape and candidate shape at `npp=512,2048`. +- [x] Run forced W4A16 `MUL_MAT_ID` op checks for selected `bm32` and old `base`. +- [x] Profile the winning candidate if it improves enough to understand the new bottleneck. +- [x] Record whether the candidate improves, regresses, or is neutral. + +Initial candidates: + +- `default` / `64x128`: current Phase 1 shape. +- `bn256`: wider N reuse, inspired by vLLM large-batch Marlin config. +- `bm32`: smaller M tiles for ragged MoE expert tails. +- `bn64`: smaller N tiles to test occupancy/latency limits. +- `stages3`: current tile shape with deeper `cp.async` pipeline. + +Kill gate: keep a shape candidate as the new default only if it improves forced W4A16 prefill throughput by at least 3% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. Otherwise revert or leave it as an off-by-env diagnostic only if it is useful for future sweeps. + +## Task 4: Mirror And Document + +- [x] Commit the accepted fork-first result with `Assisted-by: Codex:gpt-5`. +- [x] Regenerate only the new LocalAI patch mirror entry. +- [x] Verify the full LocalAI patch mirror applies to the base pin and matches fork HEAD. +- [x] Update `PARITY_HANDOFF.md` and phase results with artifact paths and decision. +- [x] Commit the LocalAI mirror/docs result with `Assisted-by: Codex:gpt-5`. + +Artifacts: + +- Profile directory: `~/bench/w4a16_phase1/profile` +- Candidate build directory: `~/llama-w4a16-phase2` +- Candidate benchmark directory: `~/bench/w4a16_phase2` + +Result: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| `base` / `64x128` | 1308.02 | 1339.46 | old baseline | +| `bn256` | 1286.99 | 1311.56 | rejected | +| `bm32` / `32x128` | 1442.99 | 1475.65 | selected | +| `bn64` | 1334.80 | 1362.55 | diagnostic only | +| `stages3` | 1271.01 | 1295.96 | rejected | +| `bn256x16` | 1084.66 | 1100.95 | rejected | + +Only `bm32` and the old `base` selector are shipped in patch `0049`. The other +candidate shapes were benchmarked in the Phase 2 build and then deliberately +left out to keep the upstream conflict surface small. + +Follow-up default verification with `LLAMA_W4A16_SHAPE` unset: + +| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s | +|----|----|---|------|--------|----------|--------|----------|-----|-------| +| 512 | 4 | 32 | 16512 | 11.360 | 1442.28 | 0.321 | 397.00 | 11.682 | 1413.43 | +| 2048 | 4 | 32 | 65664 | 44.529 | 1471.77 | 0.331 | 386.06 | 44.860 | 1463.75 | + +Profile: + +- `bm32` `w4a16_grouped_kernel`: `4107.355 ms` (`41.7%`) at profiled `npp=512`. +- Phase 1 `64x128` `w4a16_grouped_kernel`: `5231.667 ms` (`47.8%`) at profiled `npp=512`. + +Canonical post-change gates: + +- MoE command: `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test the first W4A16 kill-gate target selected by Phase 0: reduce host-built tile metadata overhead in the grouped W4A16 MoE prefill path. + +**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; LocalAI patch series is regenerated only after the fork commit is validated. + +## Task 1: Packed Tile Descriptor + +**Files:** +- Modify fork-first: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` +- Modify fork-first if needed: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` + +- [x] Replace `h_tile_expert`, `h_tile_row0`, and `h_tile_rows` with one packed tile descriptor vector. +- [x] Replace three device pool allocations and three `cudaMemcpyAsync` calls with one descriptor allocation and one H2D copy. +- [x] Keep default-off behavior unchanged when `LLAMA_W4A16_PREFILL_M=0`. + +## Task 2: Fork Build And Gates + +- [x] Build the fork on DGX from a clean source snapshot. +- [x] Run canonical MoE and dense md5 gates. +- [x] Run W4A16 off/on prefill A/B at `npp=512,2048`. +- [x] Record whether packed descriptors improve, regress, or do not materially change W4A16. + +Result: packed descriptors passed md5 gates and improved forced W4A16 by only +`+0.39%` at `npp=512` and `+0.48%` at `npp=2048`; W4A16 remains `-42.9%` and +`-44.7%` behind default FP4-MMQ respectively. + +## Task 3: Next Decision + +- [x] If W4A16 improves materially, continue metadata work toward device-side/cached descriptor generation. +- [x] If W4A16 does not improve materially, keep the patch only if it simplifies the path and choose the next target from the observed bottleneck. +- [x] Commit fork-first result, regenerate LocalAI patches, verify mirror invariant, and update LocalAI results docs. + +Decision: keep the packed descriptor patch as a simplification, but do not spend +the next iteration on metadata alone. The remaining gap is dominated elsewhere; +next target should be the activation cast or MMA/dequant tile body. diff --git a/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md b/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md new file mode 100644 index 000000000000..432d1d597af9 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-scale-broadcast-phase3.md @@ -0,0 +1,56 @@ +# W4A16 Scale Broadcast Phase 3 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test a minimal W4A16 grouped-kernel body optimization after Phase 2 selected `bm32`. + +**Scope:** Fork-first in `/home/mudler/_git/llama.cpp`; mirror into LocalAI only after build, md5, op, perf, and mirror gates pass. Keep patch `0050` incremental on top of `0049`, and keep the source diff small. + +## Task 1: Implement Scale Broadcast + +- [x] In `ggml/src/ggml-cuda/w4a16-gemm.cu`, replace per-lane duplicate `ggml_cuda_ue4m3_to_fp32` scale conversion with one conversion per 4-lane `n_local` group plus `__shfl_sync`. +- [x] Keep the existing dequant and MMA order unchanged. +- [x] Do not add broad diagnostic variants or extra launch shapes. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates on the canonical prompt. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 2 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0050`, verify mirror tree hash, update docs, and commit LocalAI. Not taken: perf gate failed. +- [x] If rejected, revert the fork experiment and record the result without adding a patch. + +Result: rejected, no fork commit and no LocalAI patch `0050`. + +Artifacts: + +- Build: `~/llama-w4a16-phase3` +- Logs: `~/bench/w4a16_phase3` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 3 scale-broadcast `bm32` | 1392.46 | 1422.74 | rejected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 3 scale-broadcast `base` | 1201.69 | 1221.25 | rejected | + +Disposition: + +- Reverted local fork experiment in `/home/mudler/_git/llama.cpp`. +- Do not retry this exact scale-broadcast approach; shuffle overhead and/or compiler scheduling cost exceeds saved FP8 scale conversion on GB10. diff --git a/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md new file mode 100644 index 000000000000..9c3d5aab8685 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-shmem-pad-phase4.md @@ -0,0 +1,58 @@ +# W4A16 Shared-Memory Padding Phase 4 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test whether padding the grouped W4A16 A tile in shared memory reduces bank conflicts after Phase 2 selected `bm32`. + +**Scope:** Fork-first experiment only. Keep the patch small, preserve math order, and ship no patch unless it passes md5/op gates and improves prefill. + +## Task 1: Implement A-Tile Padding + +- [x] Add a small shared-memory row-stride constant for `sA`. +- [x] Pad `sA` rows by 4 `uint32_t` slots while keeping 16-byte chunk alignment. +- [x] Update only A-copy and `ldmatrix` indexing; do not change W loads, dequant, MMA order, metadata, or launch shape. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 2 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0050`, verify mirror tree hash, update docs, and commit LocalAI. +- [x] If rejected, revert the fork experiment and record the result without adding a patch. Not taken: the patch was kept. + +Result: kept as fork commit `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` and LocalAI patch `0050-feat-paged-pad-W4A16-A-shared-tile-stride.patch`. + +Artifacts: + +- Build: `~/llama-w4a16-phase4` +- Logs: `~/bench/w4a16_phase4` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 2 `bm32` | 1442.28 | 1471.77 | baseline | +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | selected | +| Phase 2 `base` | 1310.13 | 1336.02 | baseline | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | positive diagnostic | + +Mirror verification: + +- Applying all 41 `patches/paged/*.patch` files to base pin + `0ed235ea2c17a19fc8238668653946721ed136fd` reproduces fork HEAD + `d9b9be0bee3d7239132bfca05d5b057ff4ee4cc3` by tree hash: + `8fcb151e0620fd0fc82b80c04318e5c34320b087`. diff --git a/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md b/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md new file mode 100644 index 000000000000..15ad199373f3 --- /dev/null +++ b/docs/superpowers/plans/2026-06-30-w4a16-wq-pad-phase5.md @@ -0,0 +1,56 @@ +# W4A16 Wq Shared-Memory Padding Phase 5 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development or superpowers:executing-plans. Keep checkboxes current while executing. + +**Goal:** Test whether padding the grouped W4A16 quantized-weight shared-memory row stride improves the post-`0050` kernel. + +**Scope:** Fork-first experiment on top of `0050`. Keep it separate and incremental. Ship no patch unless it passes md5/op gates and improves prefill. + +## Task 1: Implement Wq Padding + +- [x] Add a Wq shared-memory row-stride constant. +- [x] Pad Wq rows by 4 `uint32_t` slots. +- [x] Update only Wq copy and Wq byte-indexing; do not change A padding, Wd layout, dequant math, MMA order, metadata, or launch shape. + +## Task 2: Gates + +- [x] Build `llama-batched-bench`, `llama-completion`, and `test-backend-ops` on DGX. +- [x] Run canonical default-off paged MoE and dense greedy md5 gates. +- [x] Run forced W4A16 `bm32` vs `base` md5 gates. +- [x] Run forced W4A16 `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1`. +- [x] Run W4A16 default `bm32` A/B against Phase 4 at `npp=512,2048`. + +## Task 3: Disposition + +- [x] Keep only if it improves W4A16 prefill by at least 1% at either `npp=512` or `npp=2048` without regressing the other by more than 1%. +- [x] If kept, commit fork-first with `Assisted-by: Codex:gpt-5`, generate patch `0051`, verify mirror tree hash, update docs, and commit LocalAI. Not taken: perf gate did not clear 1%. +- [x] If rejected, revert the fork experiment and record the result without adding a patch. + +Result: rejected, no fork commit and no LocalAI patch `0051`. + +Artifacts: + +- Build: `~/llama-w4a16-phase5` +- Logs: `~/bench/w4a16_phase5` + +Gates: + +- Canonical paged MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. +- Canonical dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. +- Forced W4A16 `bm32` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `base` md5: `07db32c2bcb78d17a43ed18bc22705cd`. +- Forced W4A16 `MUL_MAT_ID`: `806/806` on CUDA0. + +Performance: + +| Shape | 512 S_PP t/s | 2048 S_PP t/s | Decision | +|-------|--------------|---------------|----------| +| Phase 4 A-pad `bm32` | 1466.62 | 1495.93 | baseline | +| Phase 5 Wq-pad `bm32` | 1472.36 | 1504.82 | rejected: below 1% gate | +| Phase 4 A-pad `base` | 1337.88 | 1364.98 | baseline | +| Phase 5 Wq-pad `base` | 1337.70 | 1368.48 | diagnostic | + +Disposition: + +- Reverted local fork experiment in `/home/mudler/_git/llama.cpp`. +- Do not ship Wq padding alone; the measured gain is below the maintenance threshold. diff --git a/docs/superpowers/plans/2026-07-01-admission-budget-sweep-phase53.md b/docs/superpowers/plans/2026-07-01-admission-budget-sweep-phase53.md new file mode 100644 index 000000000000..f20d29d6859b --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-admission-budget-sweep-phase53.md @@ -0,0 +1,99 @@ +# Phase53 Admission Budget Sweep Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether existing default-off scheduler knobs (`LLAMA_MAX_BATCH_TOKENS`, `LLAMA_PREFILL_CAP`) improve dense `n=128` serving enough to pursue a scheduler policy patch. + +**Architecture:** Temporarily apply the Phase51 trace patch to the clean DGX mirror, build the patched server, bracket the sweep with canonical md5/op gates, run dense `n=128`, `ptok=128`, `gen=64` variants, parse h2h plus admission trace rows, then revert the DGX mirror. + +**Tech Stack:** DGX GB10, llama.cpp `build-cuda`, `LLAMA_SERVING_TRACE=1`, `h2h_cli3.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Prepare patched DGX trace build + +- [x] **Step 1: Check preflight** + +Artifact: `/home/mudler/bench/phase53_dense_admission_budget_sweep/20260701_111915`. +Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, owner +`FREE released-by-codex-phase52-dense-admission-trace-clean 1782897309`. + +- [x] **Step 2: Apply Phase51 patch and build** + +Applied `/tmp/phase51-serving-admission-trace.patch` to +`~/llama-phase6-source`. Built `llama-server`, `llama-completion`, and +`test-backend-ops` in `build-cuda`. + +### Task 2: Gate before sweep + +- [x] **Step 1: Run canonical pre-sweep gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +### Task 3: Run budget variants + +- [x] **Step 1: Run `T=1536`, `cap=512`** + +Environment: `LLAMA_MAX_BATCH_TOKENS=1536 LLAMA_PREFILL_CAP=512`. + +Result: + +```text +agg=134.4 decode_agg=376.7 perseq=1.82 prefill=607.0 ttft=22263.7 wall=60.968 +steps=81 decode_only_steps=0 prompt_tokens=23809 max_waiting_prompt_slots=26 prefill_budget_step=1535 prefill_cap_per_slot=512 +``` + +- [x] **Step 2: Run `T=1024`, `cap=512`** + +Environment: `LLAMA_MAX_BATCH_TOKENS=1024 LLAMA_PREFILL_CAP=512`. + +Result: + +```text +agg=130.0 decode_agg=392.4 perseq=1.82 prefill=565.2 ttft=23234.3 wall=63.003 +steps=89 decode_only_steps=0 prompt_tokens=23809 max_waiting_prompt_slots=16 prefill_budget_step=1021 prefill_cap_per_slot=512 +``` + +### Task 4: Parse and decide + +- [x] **Step 1: Write `summary.tsv`** + +Summary: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | wall s | steps | max waiting prompt slots | +|---------|---------|-----------------|-------------|--------------|--------|-------|--------------------------| +| default Phase52 | `139.0` | `360.5` | `629.5` | `23171.5` | `58.921` | `76` | `35` | +| `T=1536 cap=512` | `134.4` | `376.7` | `607.0` | `22263.7` | `60.968` | `81` | `26` | +| `T=1024 cap=512` | `130.0` | `392.4` | `565.2` | `23234.3` | `63.003` | `89` | `16` | + +Decision: simple budget shrinkage trades aggregate/prefill throughput for a +higher h2h decode-agg metric and does not materially solve TTFT. Do not promote +these knobs as a parity lever. The next step should be either per-step histogram +tracing or a more targeted policy that improves first-token admission without +starving prefill throughput. + +### Task 5: Gate after sweep and clean DGX + +- [x] **Step 1: Run canonical post-sweep gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 2: Revert temporary DGX patch** + +Reverted the Phase51 patch from `~/llama-phase6-source`. Final DGX state: +docker `0`, `local-ai-worker` `0`, compute `0`, owner +`FREE released-by-codex-phase53-budget-sweep 1782897825`. + +- [x] **Step 3: Commit docs** + +Commit this plan and parity doc updates with `Assisted-by: Codex:gpt-5`. diff --git a/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md b/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md new file mode 100644 index 000000000000..bed2b83f3fc4 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-admission-histogram-trace-phase54.md @@ -0,0 +1,139 @@ +# Phase54 Admission Histogram Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Extend the Phase51 default-off serving trace with compact per-step histograms so scheduler work can see whether the dense high-N run is dominated by a few very large prompt-admission steps, many small mixed steps, or waiting-slot tails. + +**Architecture:** Keep the trace fork-first and default-off behind `LLAMA_SERVING_TRACE=1`. Add only accumulator buckets and formatter output, then temporarily apply the Phase51+Phase54 stack to the DGX mirror, bracket with canonical md5/op gates, run the Phase52-aligned dense trace, and revert the DGX mirror. + +**Tech Stack:** llama.cpp fork, `tools/server/server-admission-trace.h`, CMake unit test, DGX GB10 `build-cuda`, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Add red histogram assertions + +- [x] **Step 1: Extend the focused unit test** + +Added assertions to `tests/test-server-admission-trace.cpp` requiring: + +- `prompt_hist=0:1,257-512:1` +- `decode_hist=128-255:2` +- `waiting_hist=1-7:2` + +- [x] **Step 2: Verify red** + +Observed failure before implementation: + +```text +missing 'prompt_hist=0:1,257-512:1' +``` + +### Task 2: Implement histogram counters + +- [x] **Step 1: Add bucket counters and formatting** + +Added prompt-token, decode-token, and waiting-slot histograms to +`server_admission_trace_totals`. The formatter emits only nonzero buckets. + +- [x] **Step 2: Verify local green** + +Commands: + +```bash +cmake --build build --target test-server-admission-trace -j2 +./build/bin/test-server-admission-trace +ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure +cmake --build build --target llama-server -j2 +``` + +Observed: focused unit test passed, CTest passed, and `llama-server` built. The +local UI asset build first hit a Node engine mismatch and then recovered through +the repo's downloaded UI bundle path. + +### Task 3: Commit fork patch + +- [x] **Step 1: Commit on the llama.cpp fork** + +Local fork commit: + +```text +bd7b2e952 feat(server): add admission trace histograms +``` + +Fork stack now has two unpushed trace commits: + +- `c6cb8460e feat(server): trace serving admission batches` +- `bd7b2e952 feat(server): add admission trace histograms` + +- [ ] **Step 2: Push fork branch** + +Blocked by policy: ask before every push. Do not push without explicit approval. + +- [ ] **Step 3: Regenerate LocalAI patch series** + +Pending until the fork branch is pushed, per the fork-first mirror invariant. + +### Task 4: Verify on DGX + +- [x] **Step 1: Apply temporary stack and build** + +Applied `/tmp/phase54-admission-trace-stack.patch` to the clean +`~/llama-phase6-source` mirror. Built `test-server-admission-trace`, +`llama-server`, `llama-cli`, and `test-backend-ops` in `build-cuda`. + +DGX CTest passed: + +```bash +ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure +``` + +- [x] **Step 2: Run canonical pre/post inference gates** + +Artifact: +`/home/mudler/bench/phase54_admission_hist_trace/20260701_113201`. + +Pre and post gates both matched: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 3: Run dense histogram trace** + +First diagnostic run used `--ptok 128` and produced `prompt_tok_total=17793`; +kept as `paged_hist/`. + +The Phase52-aligned run used `--ptok 168`, matching the prior prompt envelope: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22913, "gen_per_req": 64.0, "agg_tps": 138.1, "decode_agg_tps": 360.2, "decode_perseq_tps": 1.92, "prefill_tps": 626.7, "ttft_mean_ms": 23393.2, "ttft_max_ms": 36560.5, "wall_s": 59.303} +``` + +Trace: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +### Task 5: Clean up and decide + +- [x] **Step 1: Revert temporary DGX stack** + +Reverted the temporary patch stack and removed the two untracked trace files it +created on the DGX mirror. Final source tree was clean. + +Final DGX state: + +- Docker containers: `0` +- GPU compute apps: `0` +- Lock: `FREE released-by-codex-phase54-hist 1782898659` + +- [x] **Step 2: Record decision** + +The histogram shows the default scheduler spends `63/76` steps with no prompt +tokens and no waiting prompts, then admits prompt work in a small number of very +large prompt chunks (`prompt_hist=513+:12`). Decode remains mostly full-width +(`decode_hist=128-255:53`) and there are still no pure decode-only steps. Static +budget shrinkage is already rejected; the next scheduler A/B should target +first-token admission or prompt-front loading, not lower global batch budgets. diff --git a/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md b/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md new file mode 100644 index 000000000000..57673cedd881 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-audited-current-stack-snapshot-phase26.md @@ -0,0 +1,69 @@ +# Audited Current Stack Snapshot Phase 26 Plan + +**Date:** 2026-07-01 +**Phase:** 26 +**Goal:** run the reusable current-stack paged-vs-vLLM serving harness end to +end on the DGX, with hardware and compact inference gates attached to the +artifact, so throughput comparisons cannot hide an inference regression. + +## Context + +Phase 20 refreshed the current-stack serving numbers. Phase 24 added +`hardware.txt`; Phase 25 added `gate_summary.tsv`. Phase 26 is the first full +serving run that uses both audit surfaces in one artifact. + +## Checklist + +- [x] **Step 1: Preflight DGX** + - Verified no running docker containers before launch. + - Verified no `local-ai-worker` container before launch. + - Verified no active GPU compute processes before launch. + - Used the owner-file GPU lock protocol. + +- [x] **Step 2: Launch full current-stack snapshot** + - Ran `paged-current-serving-snapshot.sh` from the LocalAI worktree copy. + - Target source: `dgx:~/llama-phase6-source`. + - Source HEAD: `f2521ab12 feat(server): trace speculative batch shapes`. + - Artifact: `/home/mudler/bench/phase26_audited_snapshot/20260701_053650`. + +- [x] **Step 3: Preserve hardware evidence** + - `hardware.txt` recorded `hardware_class=gb10_or_workstation_blackwell`. + - `hardware.txt` recorded `GPU 0: NVIDIA GB10`. + - Driver: `580.159.03`. + - Compute capability: `12.1`. + +- [x] **Step 4: Gate inferencing before and after serving** + - Pre MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Pre dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Pre `MUL_MAT_ID`: `806/806`. + - Post MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Post dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Post `MUL_MAT_ID`: `806/806`. + - `gate_summary.tsv` records all rows as `ok`. + +- [x] **Step 5: Capture same-session serving numbers** + - Paged and vLLM were run in the same artifact with the same h2h client. + - `summary.tsv` records the aggregate, decode, per-sequence, TTFT, and prefill + rows plus ratios. + +- [x] **Step 6: Record results in project docs** + - Updated `README.md` with Phase 26 as the latest current-stack snapshot. + - Updated `GB10_PARITY_PHASE0_RESULTS.md` with the full audited result. + - Updated `PARITY_HANDOFF.md` with the operational handoff result and artifact + index. + - Updated `VLLM_PARITY_LEVER_MAP.md` with the current benchmark baseline. + +## Result + +Phase 26 confirms that the current clean stack still does not reach vLLM serving +parity on GB10, while the inference gates remain green before and after the +serving benchmark. + +| n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | +|---|------------------|-----------------|-------------------|-----------|----------|----------------| +| 8 | 230.8 | 283.2 | 81.5% | 170.6 | 241.6 | 70.6% | +| 32 | 420.0 | 609.0 | 69.0% | 254.6 | 466.7 | 54.6% | +| 128 | 673.4 | 1025.0 | 65.7% | 324.0 | 656.5 | 49.4% | + +Treat `/home/mudler/bench/phase26_audited_snapshot/20260701_053650` as the +current audit-grade GB10 baseline. diff --git a/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md b/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md new file mode 100644 index 000000000000..24e3daa2097c --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md @@ -0,0 +1,244 @@ +# BF16 cuBLAS F32 Output Phase67 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether BF16 projection GEMMs can write F32 output directly and remove the BF16-to-F32 conversion kernel without breaking inference. + +**Architecture:** Add a default-off `LLAMA_BF16_CUBLAS_F32_OUT=1` branch in the CUDA BF16 cuBLAS path. The default path remains byte-identical. The opt-in path is accepted only if canonical md5/op gates pass and the measured GPU kernel-time reduction is material. + +**Tech Stack:** llama.cpp CUDA backend, cuBLAS `cublasGemmEx`, DGX GB10, LocalAI parity docs, canonical md5 and backend-op gates. + +--- + +## Guardrails + +- Default behavior must remain unchanged when `LLAMA_BF16_CUBLAS_F32_OUT` is unset. +- The source patch must be small and local to the BF16 cuBLAS path. +- The opt-in path is rejected unless it passes: + - MoE paged md5 `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + - `test-backend-ops` `MUL_MAT` +- If the opt-in path changes md5, do not benchmark it as a parity shortcut unless a KL plan is explicitly created later. +- Do not regenerate LocalAI patch files in this phase. +- Do not push without explicit approval. + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Add Default-Off BF16 F32 Output Branch + +- [x] **Step 1: Add env helper** + +Add near the cuBLAS route helpers in `ggml-cuda.cu`: + +```c++ +static inline bool ggml_cuda_bf16_cublas_f32_out_enabled() { + static const bool value = []() { + const char * s = getenv("LLAMA_BF16_CUBLAS_F32_OUT"); + return s != nullptr && atoi(s) != 0; + }(); + + return value; +} +``` + +- [x] **Step 2: Branch BF16 cuBLAS output** + +In the `src0->type == GGML_TYPE_BF16` cuBLAS branch, keep the current BF16 +temporary path as the default. When the env is enabled, call `cublasGemmEx` with +the existing BF16 inputs and `dst_dd_i` as `CUDA_R_32F`, then skip the +`to_fp32_cuda` conversion: + +```c++ +if (ggml_cuda_bf16_cublas_f32_out_enabled()) { + CUBLAS_CHECK(cublasGemmEx(ctx.cublas_handle(id), CUBLAS_OP_T, CUBLAS_OP_N, + row_diff, src1_ncols, ne10, + &alpha_f32, src0_ptr, CUDA_R_16BF, ne00, + src1_ptr, CUDA_R_16BF, ne10, + &beta_f32, dst_dd_i, CUDA_R_32F, ldc, + CUBLAS_COMPUTE_32F, + CUBLAS_GEMM_DEFAULT_TENSOR_OP)); +} else { + // existing BF16 temp plus BF16-to-F32 conversion +} +``` + +- [x] **Step 3: Local diff check** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: exit `0`. + +--- + +### Task 2: DGX Build and Default Gates + +- [x] **Step 1: Confirm DGX is idle** + +Run: + +```bash +ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l' +``` + +Expected: lock `FREE*`, Docker `0`, worker `0`, compute apps `0`. + +- [x] **Step 2: Acquire lock** + +Run: + +```bash +ssh dgx.casa 'printf "codex-phase67-bf16-f32-out %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 3: Apply patch and build** + +Apply the patch to `/home/mudler/llama-phase6-source`, then run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)' +``` + +Expected: exit `0`. + +- [x] **Step 4: Run default gates** + +Run canonical MoE and dense md5 plus: + +```bash +./test-backend-ops test -o MUL_MAT +``` + +Expected default path: + +```text +MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 5951a5b4d624ce891e22ab5fca9bc439 +MUL_MAT 1146/1146 +``` + +Observed artifact: `/home/mudler/bench/phase67_bf16_f32_out/20260701_144909`. + +```text +default MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +default dense md5 5951a5b4d624ce891e22ab5fca9bc439 +default MUL_MAT 1146/1146 +``` + +--- + +### Task 3: Opt-In Correctness Gate + +- [x] **Step 1: Run opt-in md5 gates** + +Run the same MoE and dense commands with: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 +``` + +Expected: exact same md5s. If either md5 differs, reject the source shortcut. + +Observed: + +```text +opt-in MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +opt-in dense md5 5951a5b4d624ce891e22ab5fca9bc439 +``` + +- [x] **Step 2: Run opt-in backend-op gate** + +Run: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 ./test-backend-ops test -o MUL_MAT +``` + +Expected: `1146/1146`. + +Observed: `1146/1146`. + +--- + +### Task 4: Benchmark if Correct + +- [x] **Step 1: Run same-shape prefill A/B** + +Only if Task 3 passes, run baseline and opt-in: + +```bash +./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +Observed MoE A/B: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `2347.41` | `2402.34` | `+2.34%` | +| `2048` | `2440.18` | `2456.54` | `+0.67%` | + +- [x] **Step 2: Profile opt-in if A/B improves** + +Use nsys kernel summary to verify the BF16-to-F32 conversion rows shrink. + +Observed opt-in `npp=512` profile: + +| row | value | +|-----|------:| +| total GPU kernel time | `7020867757 ns` | +| `convert_unary<__nv_bfloat16, float>` | `0 ns`, `0` instances | +| `convert_unary` | `159651026 ns`, `6840` instances, `2.27%` | + +- [x] **Step 3: Source decision** + +Keep the patch only if opt-in gates pass and it produces a material speedup. +Otherwise revert the source patch locally and record the rejection. + +Decision: keep as a default-off opt-in path. It is correctness-clean and removes +the profiled BF16-to-F32 conversion row for this shape, but the speedup is small +and needs dense plus serving A/B before any default-on decision. + +--- + +### Task 5: Commit and Record + +- [x] **Step 1: Commit source only if accepted as default-off diagnostic or opt-in** + +```bash +git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/ggml-cuda.cu +git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): gate BF16 cuBLAS F32 output" -m "Assisted-by: Codex:gpt-5" +``` + +Result: + +- Local fork: `ea0875d14 feat(cuda): gate BF16 cuBLAS F32 output` +- DGX mirror: `14fd69f1e feat(cuda): gate BF16 cuBLAS F32 output` + +- [x] **Step 2: Record LocalAI docs** + +Record artifact path, gates, A/B result, and decision. + +- [x] **Step 3: Commit LocalAI docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-cublas-f32-output-phase67.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 cuBLAS F32 output phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md b/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md new file mode 100644 index 000000000000..592f9da54d69 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md @@ -0,0 +1,156 @@ +# BF16 F32 Output Broader Serving Phase70 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough broader serving evidence to move beyond default-off opt-in status. + +**Architecture:** Do not change source. Reuse the Phase67 DGX mirror and binary, bracket the benchmark with canonical inference gates, then run same-window llama.cpp default, llama.cpp opt-in, and vLLM serving arms across multiple concurrencies. + +**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-server`, vLLM 0.23.0, `h2h_cli3.py`, LocalAI parity docs. + +--- + +## Guardrails + +- Do not change llama.cpp source in Phase70. +- Do not regenerate LocalAI generated patches. +- Do not push any repository. +- Confirm Docker `0`, `local-ai-worker` `0`, and GPU compute apps `0` before taking the DGX lock. +- Bracket serving with md5/op gates so inferencing safety is explicit. +- Keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off unless broad serving is consistently flat-to-positive with gates green. + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: DGX Preflight And Gates + +- [x] **Step 1: Confirm DGX idle** + +Run: + +```bash +ssh dgx.casa 'set -e; cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps -q | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l' +``` + +Expected: + +```text +FREE... +0 +0 +0 +``` + +- [x] **Step 2: Run pre gates** + +Run canonical gates with default env and opt-in completion env: + +```bash +ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving//gate_pre_default OPS=MUL_MAT,MUL_MAT_ID ~/paged-inference-gates.sh' +ssh dgx.casa 'ART=$HOME/bench/phase70_bf16_broader_serving//gate_pre_optin OPS=MUL_MAT EXTRA_ENV="LLAMA_BF16_CUBLAS_F32_OUT=1" ~/paged-inference-gates.sh' +``` + +Expected: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- op gates green. + +Result: + +- Artifact: `/home/mudler/bench/phase70_bf16_broader_serving/20260701_151500` +- Default pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`, + `MUL_MAT_ID 806/806`. +- Opt-in pre gates: MoE/dense md5 matched, `MUL_MAT 1146/1146`. + +### Task 2: Same-Window Serving Snapshot + +- [x] **Step 1: Acquire lock** + +Use both active lock conventions: + +```bash +ssh dgx.casa 'mkdir -p ~/gpu_bench_lock; echo "codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Run three serving arms** + +Run: + +- llama.cpp default +- llama.cpp with `LLAMA_BF16_CUBLAS_F32_OUT=1` +- vLLM + +Shape: + +```text +model=MoE q36-35b-a3b-nvfp4 +NPL=8 32 128 +PTOK=128 +GEN=64 +PARALLEL=128 +CTX=131072 +``` + +- [x] **Step 3: Release lock** + +Run: + +```bash +ssh dgx.casa 'echo "FREE released-by-codex-phase70-bf16-broader-serving $(date +%s)" > ~/gpu_bench_lock/owner; printf "FREE released-by-codex-phase70-bf16-broader-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +### Task 3: Post Gates And Decision + +- [x] **Step 1: Run post gates** + +Repeat default and opt-in gates after serving. + +- [x] **Step 2: Summarize metrics** + +Capture for each `N`: + +- default vs opt-in aggregate throughput +- default vs opt-in decode aggregate throughput +- default vs opt-in TTFT +- opt-in vs vLLM decode and aggregate ratios + +- [x] **Step 3: Decision** + +Keep default-off if any concurrency materially regresses or if the result is mixed. Consider default-on only if all concurrencies are flat-to-positive, post gates are green, and the opt-in does not widen the vLLM parity gap. + +Result summary: + +| n | default agg | opt-in agg | opt/default agg | default decode | opt-in decode | opt/default decode | +|---:|------------:|-----------:|----------------:|---------------:|--------------:|-------------------:| +| `8` | `178.5` | `158.8` | `0.8896` | `242.6` | `218.3` | `0.8998` | +| `32` | `250.1` | `247.9` | `0.9912` | `418.7` | `417.6` | `0.9974` | +| `128` | `322.5` | `324.8` | `1.0071` | `706.2` | `697.9` | `0.9882` | + +Decision: reject default-on. The opt-in materially regressed low-concurrency +serving and slightly widened the vLLM decode gap at `n=32` and `n=128`, despite +green gates. + +### Task 4: Record And Commit + +- [x] **Step 1: Update docs** + +Record artifact path, gates, serving table, ratio table, and decision. + +- [x] **Step 2: Commit docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-broader-serving-phase70.md +git add backend/cpp/llama-cpp-localai-paged/docs/BENCHMARK.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 F32 output broader serving phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md b/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md new file mode 100644 index 000000000000..20726375777e --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md @@ -0,0 +1,132 @@ +# BF16 F32 Output Dense Serving Phase68 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Decide whether `LLAMA_BF16_CUBLAS_F32_OUT=1` has enough dense and serving value to consider a default policy change. + +**Architecture:** Reuse the Phase67 source patch and DGX build. Run dense prefill A/B first because it is fast and directly targets BF16 projections. Run serving A/B only if dense or MoE evidence supports a broader default-on question. + +**Tech Stack:** llama.cpp CUDA backend, DGX GB10, `llama-batched-bench`, optional LocalAI serving snapshot harness, LocalAI parity docs. + +--- + +## Guardrails + +- Do not change source in Phase68. +- Do not make `LLAMA_BF16_CUBLAS_F32_OUT=1` default-on from MoE prefill alone. +- Keep DGX lock discipline: lock free, Docker `0`, `local-ai-worker` `0`, compute apps `0`. +- Keep existing md5/op gate evidence from Phase67 as the correctness basis for this exact source commit. +- Record no-go results as explicitly as wins. + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Dense Prefill A/B + +- [x] **Step 1: Confirm DGX idle and acquire lock** + +Run: + +```bash +ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l' +ssh dgx.casa 'printf "codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Run dense prefill default and opt-in** + +Run: + +```bash +./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +with and without `LLAMA_BF16_CUBLAS_F32_OUT=1`. + +- [x] **Step 3: Dense decision** + +Dense improved slightly in the same window and did not regress: + +| npp | default S_PP | opt-in S_PP | change | +|-----|-------------:|------------:|-------:| +| `512` | `973.13` | `975.52` | `+0.25%` | +| `2048` | `1019.88` | `1021.39` | `+0.15%` | + +Decision: run a small MoE serving A/B because Phase67 MoE prefill was positive +and dense did not regress. The dense win is too small to justify default-on by +itself. + +--- + +### Task 2: Serving A/B If Funded + +- [x] **Step 1: Run a small same-window serving A/B** + +Use the current clean source tree and the existing h2h client or snapshot harness. +Compare default versus: + +```bash +LLAMA_BF16_CUBLAS_F32_OUT=1 +``` + +At minimum capture MoE `N=128`, prompt `128`, generation `128` aggregate, +decode aggregate, mean TTFT, wall time, and md5 gate summary. + +- [x] **Step 2: Serving decision** + +Keep default-off unless serving improves or is flat without dense regression. +Do not default-on from prefill-only evidence. + +Serving artifact: + +- `/home/mudler/bench/phase68_bf16_dense_serving/20260701_145710/serving_ab_20260701_150249` + +MoE serving A/B, `N=128`, prompt `128`, generation `128`, `--parallel 128`: + +| metric | default | opt-in | change | +|--------|--------:|-------:|-------:| +| `agg_tps` | `409.8` | `415.0` | `+1.27%` | +| `decode_agg_tps` | `615.3` | `627.2` | `+1.93%` | +| `decode_perseq_tps` | `4.15` | `4.16` | `+0.24%` | +| `prefill_tps` | `1630.2` | `1648.0` | `+1.09%` | +| `ttft_mean_ms` | `8574.7` | `8085.9` | `-5.70%` | +| `wall_s` | `39.978` | `39.480` | `-1.25%` | + +Decision: keep `LLAMA_BF16_CUBLAS_F32_OUT=1` default-off but promoted as a +safe opt-in shortcut candidate. It now has Phase67 MoE md5/op gates, Phase67 +dense md5/op gates, a tiny positive dense prefill result, and a positive small +MoE serving A/B. Do not make it default-on until it is patch-series mirrored and +retested in a broader serving snapshot. + +--- + +### Task 3: Record and Commit + +- [x] **Step 1: Release DGX lock** + +Run: + +```bash +ssh dgx.casa 'printf "FREE released-by-codex-phase68-bf16-dense-serving %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +- [x] **Step 2: Record docs** + +Record artifact path, dense A/B, serving A/B if run, and decision. + +- [x] **Step 3: Commit LocalAI docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-bf16-f32-output-dense-serving-phase68.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record BF16 F32 output dense serving phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-cublas-name-trace-phase37.md b/docs/superpowers/plans/2026-07-01-cublas-name-trace-phase37.md new file mode 100644 index 000000000000..830f17c18ae7 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-cublas-name-trace-phase37.md @@ -0,0 +1,69 @@ +# Phase 37: cuBLAS Tensor-Name Trace + +**Status:** DONE. + +**Scope:** additive follow-up to patch `0062`. Extend the default-off +`LLAMA_CUBLAS_ROUTE_TRACE=` diagnostic with `src0`, `src1`, and `dst` tensor +names. No route or numeric behavior change. + +## Checklist + +- [x] Add RED/GREEN helper coverage for cuBLAS tensor-name trace fields. +- [x] Wire tensor names from the generic cuBLAS path. +- [x] Build CUDA targets on DGX. +- [x] Run md5 gates with trace off and trace on. +- [x] Run backend op gates with trace off and trace on. +- [x] Capture n128 serving name trace. +- [x] Run post-serving md5/op gates. +- [x] Commit fork and DGX mirror, export LocalAI patch `0063`. + +## Result + +Artifact: `/home/mudler/bench/phase37_cublas_name_trace/20260701_083227`. + +- Local fork commit: `2d590d770 feat(cuda): trace cublas tensor names` +- DGX mirror commit: `2cbb61969 feat(cuda): trace cublas tensor names` +- Local/DGX tree after Phase 37: `dedb1182910eafe9f6875588dc8285bfb544cce5` +- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0063-feat-cuda-trace-cublas-tensor-names.patch` + +## Gates + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +## Serving Trace + +`LLAMA_CUBLAS_ROUTE_TRACE=4096`, n128 MoE serving: + +| cuBLAS route | count | +|--------------|------:| +| `bf16_tc` | 2884 | +| `sgemm` | 1212 | + +Top named entries were per-layer projections: + +- `bf16_tc type=30 src0=blk.N.attn_gate.weight src1=attn_norm-N dst=z-N` +- `bf16_tc type=30 src0=blk.N.ssm_out.weight src1=final_output-N dst=linear_attn_out-N` +- `sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N` +- `sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N` + +The traced serving run is diagnostic only; stderr tracing still depresses +throughput and can create client-window disconnects. Post-serving md5/op gates +remained green. + +## Decision + +- The Phase 36 F32 SGEMM bucket is not an opaque missed projection. It is mostly + MoE gating and shared-expert gate projection tensors whose weights are F32. +- The next route-policy phase should not blindly force these to BF16. First + inspect model-load tensor types for `ffn_gate_inp*` and decide whether a + weight-conversion or graph-build route change is precision-safe. Any change + needs md5/op gates and, if tensor type conversion is involved, KL validation. diff --git a/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md b/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md new file mode 100644 index 000000000000..6c651af7dc75 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-cublas-route-trace-phase36.md @@ -0,0 +1,72 @@ +# Phase 36: cuBLAS Route Trace + +**Status:** DONE. + +**Scope:** llama.cpp fork first, then LocalAI patch `0062`. Instrumentation only; +no route, branch, or numeric behavior change. + +## Checklist + +- [x] Add RED/GREEN helper tests for cuBLAS subroute classification. +- [x] Add default-off `LLAMA_CUBLAS_ROUTE_TRACE=` around generic cuBLAS + `MUL_MAT` dispatch. +- [x] Build CUDA targets on DGX. +- [x] Run md5 gates with trace off and trace on. +- [x] Run backend op gates with trace off and trace on. +- [x] Capture n128 serving route distribution. +- [x] Run post-serving md5/op gates. +- [x] Commit fork and DGX mirror, export LocalAI patch `0062`. + +## Result + +Artifact: `/home/mudler/bench/phase36_cublas_route_trace/20260701_081228`. + +- Local fork commit: `38c4ef2e4 feat(cuda): trace cublas routes` +- DGX mirror commit: `e0224393a feat(cuda): trace cublas routes` +- Local/DGX tree after Phase 36: `208189d119efe27477f1900cc6f7428bd1720449` +- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0062-feat-cuda-trace-cublas-routes.patch` + +## Gates + +| check | status | actual | +|-------|--------|--------| +| default-off MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| default-off dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| trace-enabled MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| trace-enabled dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| post-serving MoE md5 | ok | `8cb0ce23777bf55f92f63d0292c756b0` | +| post-serving dense md5 | ok | `5951a5b4d624ce891e22ab5fca9bc439` | +| `MUL_MAT` | ok | `1146/1146` default, trace, post-serving | +| `MUL_MAT_ID` | ok | `806/806` default, trace, post-serving | + +## Serving Trace + +`LLAMA_CUBLAS_ROUTE_TRACE=8192`, n128 MoE serving: + +| cuBLAS route | count | +|--------------|------:| +| `bf16_tc` | 5681 | +| `sgemm` | 2511 | + +Top shapes: + +| route | shape | count | +|-------|-------|------:| +| `bf16_tc` | `type=30 row_diff=32 src1_ncols=510 ne00=2048 ne10=2048` | 360 | +| `bf16_tc` | `type=30 row_diff=8192 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `bf16_tc` | `type=30 row_diff=2048 src1_ncols=510 ne00=4096 ne10=4096` | 240 | +| `sgemm` | `type=0 row_diff=256 src1_ncols=510 ne00=2048 ne10=2048` | 240 | +| `sgemm` | `type=0 row_diff=1 src1_ncols=510 ne00=2048 ne10=2048` | 240 | + +The traced serving run is diagnostic only: heavy stderr tracing depressed +throughput and the client window reported disconnects at shutdown. The +post-serving md5/op gates above stayed green. + +## Decision + +- Generic cuBLAS serving calls are BF16 tensor-core and F32 SGEMM; the measured + route does not show NVFP4 cuBLAS or batched cuBLAS as the next bucket. +- The next projection phase should investigate why the F32 SGEMM shapes remain + `type=0` and whether they are expected glue/projection tensors or a missed + BF16 route. Any route-policy change must be separately gated by the same md5 + and `test-backend-ops` checks before benchmarking. diff --git a/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md b/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md new file mode 100644 index 000000000000..c4e035a86d19 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-current-serving-harness-phase21.md @@ -0,0 +1,107 @@ +# Current Serving Harness Phase 21 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make the Phase 20 current-stack paged-vs-vLLM serving snapshot +repeatable from the LocalAI backend tree. + +**Architecture:** add a standalone shell harness beside the existing paged +inference gate and MTP serving harness. The script targets the clean +`~/llama-phase6-source` mirror, uses the owner-file GPU lock, runs pre/post +inference gates, compares paged and vLLM in one session, and writes ratio +summaries. + +**Tech Stack:** Bash, llama.cpp `llama-server`, vLLM, `h2h_cli3.py`, DGX GB10. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove no reusable current-stack harness exists** + + Command: + + ```bash + test -e backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + ``` + + Result: + + - exited `1` before the patch, as expected. + +## Task 2: Add Harness + +- [x] **Step 1: Create script** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Features: + + - defaults to `~/llama-phase6-source`, not stale `~/llama-paged-dev`; + - checks docker, `local-ai-worker`, GPU compute processes, and owner-file lock; + - builds `llama-server`, `llama-completion`, and `test-backend-ops`; + - runs pre/post `paged-inference-gates.sh`; + - runs paged and vLLM serving arms with the same h2h client; + - writes `summary.tsv` with paged/vLLM ratios; + - supports `DRY_RUN=1` for path/preflight validation without servers. + +## Task 3: Verify Harness + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + test -x backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - all passed. + +- [x] **Step 2: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase21_harness_dryrun/20260701_051757 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - verified owner file was free; + - found current source `f2521ab12`; + - validated required paths and printed the build/paged/vLLM commands without + launching servers. + + Artifact: + + - `/home/mudler/bench/phase21_harness_dryrun/20260701_051757` + +## Task 4: Future Use + +- [x] **Step 1: Prefer this harness for current snapshots** + + Use this script for future current-stack GB10 parity snapshots: + + ```bash + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + ``` + + Do not use the stale DGX `~/bench/combined_definitive.sh` without first + porting it to the clean mirror and owner-file lock discipline. + +## Self-Review + +- No llama.cpp source behavior changed. +- The harness is repeatable and defaults to the current clean mirror. +- The dry run covered path validation and DGX preflight without consuming GPU + benchmark time. diff --git a/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md b/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md new file mode 100644 index 000000000000..20598110b639 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-current-stack-serving-snapshot-phase20.md @@ -0,0 +1,123 @@ +# Current Stack Serving Snapshot Phase 20 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** refresh the MoE paged-vs-vLLM serving baseline on the current clean +llama.cpp stack after the MTP investigation. + +**Architecture:** benchmark only. Run the current DGX mirror +`~/llama-phase6-source` against vLLM in the same lock window with the same h2h +client, then run canonical pre/post inference gates. Do not change source. + +**Tech Stack:** llama.cpp `llama-server`, vLLM `0.23.0`, DGX GB10, +`h2h_cli3.py`, LocalAI paged patch stack. + +--- + +## Task 1: Run Current-Stack Snapshot + +- [x] **Step 1: Confirm DGX is free** + + Preflight passed: + + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + +- [x] **Step 2: Build current mirror targets** + + Source: + + - `/home/mudler/llama-phase6-source` + - HEAD: `f2521ab12 feat(server): trace speculative batch shapes` + + Build: + + ```bash + cmake --build ~/llama-phase6-source/build-cuda \ + --target llama-server llama-completion test-backend-ops -j8 + ``` + +- [x] **Step 3: Run paged and vLLM serving arms** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621` + + Workload: + + - MoE Qwen3.6-35B-A3B-NVFP4 + - `NPL=8,32,128` + - `PTOK=128` + - `GEN=64` + - h2h OpenAI completions client with fresh nonces + +## Task 2: Verify Inference Gates + +- [x] **Step 1: Pre-gate passed** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_pre` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 2: Post-gate passed** + + Artifact: + + - `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_post` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 3: Snapshot Result + +- [x] **Step 1: Compare serving throughput** + + | n | paged decode_agg | vLLM decode_agg | paged/vLLM decode | paged agg | vLLM agg | paged/vLLM agg | + |---|------------------|-----------------|-------------------|-----------|----------|----------------| + | 8 | 220.8 | 290.5 | 76.0% | 164.8 | 245.5 | 67.1% | + | 32 | 411.1 | 594.7 | 69.1% | 252.1 | 456.0 | 55.3% | + | 128 | 670.0 | 1022.7 | 65.5% | 322.4 | 662.4 | 48.7% | + +- [x] **Step 2: Compare latency and prefill** + + | n | paged TTFT ms | vLLM TTFT ms | paged/vLLM TTFT | paged prefill_tps | vLLM prefill_tps | + |---|---------------|--------------|------------------|--------------------|------------------| + | 8 | 783.6 | 271.8 | 2.88x | 1669.9 | 4371.5 | + | 32 | 2630.6 | 783.8 | 3.36x | 1712.8 | 5358.3 | + | 128 | 7678.7 | 2465.7 | 3.11x | 1660.4 | 5242.9 | + + The current stack remains far from vLLM serving parity in e2e/TTFT because + prefill is still much slower. + +## Task 4: Decision + +- [x] **Step 1: Keep GB10 shortcut closure** + + This snapshot confirms the Phase 19 direction: + + - MTP and scheduling shortcuts should stay closed. + - Current paged serving is still below vLLM on MoE serving throughput. + - The largest user-visible gap is prefill/TTFT, where vLLM is roughly 2.6-3.2x + faster on this short serving snapshot. + - The next credible parity path is not another small GB10 server shortcut; it + is either a new-silicon rerun on datacenter Blackwell or a larger fused + kernel project outside the low-conflict patch stack. + +## Self-Review + +- No source behavior changed. +- Pre/post inference gates passed. +- The result uses the current clean mirror, not the stale `llama-paged-dev` + benchmark tree. diff --git a/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md b/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md new file mode 100644 index 000000000000..81ccea8deb1b --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-admission-trace-phase52.md @@ -0,0 +1,105 @@ +# Phase52 Dense Admission Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use the Phase51 `LLAMA_SERVING_TRACE=1` fork patch to capture dense `n=128` llama-server admission counters and determine whether high-N serving loss is scheduler/admission-driven. + +**Architecture:** Temporarily apply the Phase51 fork patch to the clean DGX mirror, build the patched server, bracket the traced serving run with canonical md5/op gates, run one dense `n=128`, `ptok=128`, `gen=64` h2h workload, parse the aggregate trace, then revert the DGX mirror. + +**Tech Stack:** DGX GB10, `~/llama-phase6-source/build-cuda`, `h2h_cli3.py`, `paged-inference-gates.sh`, LocalAI parity docs. + +--- + +### Task 1: Prepare patched DGX build + +**Files:** +- DGX artifact: `/home/mudler/bench/phase52_dense_admission_trace/20260701_111017` + +- [x] **Step 1: Check DGX preflight** + +Observed before applying the patch: docker `0`, `local-ai-worker` `0`, +compute `0`, owner `FREE released-by-codex-phase50-dense-true-decode +1782895927`. + +- [x] **Step 2: Apply Phase51 patch and build** + +Applied `/tmp/phase51-serving-admission-trace.patch` to +`~/llama-phase6-source`. Built `llama-server`, `llama-completion`, and +`test-backend-ops` in `build-cuda`. + +### Task 2: Gate before trace + +- [x] **Step 1: Run canonical pre-trace inference gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +### Task 3: Run dense admission trace + +- [x] **Step 1: Run warm trace** + +First trace included warmup and was kept only as a secondary artifact: +`paged/`. Because `started_prompt_slots=136`, it combined warmup `n=8` and the +target `n=128` request. + +- [x] **Step 2: Run clean `n=128` trace** + +Clean artifact: `paged_clean/`. + +H2H row: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22785, "gen_per_req": 64.0, "agg_tps": 139.0, "decode_agg_tps": 360.5, "decode_perseq_tps": 1.93, "prefill_tps": 629.5, "ttft_mean_ms": 23171.5, "ttft_max_ms": 36195.3, "wall_s": 58.921} +``` + +Trace row: + +```text +serving admission trace: steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22785 waiting_prompt_slots=267 max_waiting_prompt_slots=35 started_prompt_slots=128 continued_prompt_slots=139 last_n_batch=2048 last_n_ubatch=512 last_prefill_budget_step=0 last_prefill_cap_per_slot=0 +``` + +Parsed summary: `phase52_summary.json`. + +### Task 4: Gate after trace and clean DGX + +- [x] **Step 1: Run canonical post-trace inference gate** + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 2: Revert temporary DGX patch** + +Reverted `/tmp/phase51-serving-admission-trace.patch` from +`~/llama-phase6-source`. Final DGX state: docker `0`, `local-ai-worker` `0`, +compute `0`, owner `FREE released-by-codex-phase52-dense-admission-trace-clean +1782897309`. + +### Task 5: Record decision + +- [x] **Step 1: Update parity docs** + +Record Phase52 artifact and interpretation: + +- Prompt tokens admitted by the server trace exactly match h2h + `prompt_tok_total`, so the trace maps to the target request. +- `decode_only_steps=0`, so the default scheduler never emits pure decode steps + for this dense high-N serving shape. +- Prompt admission happens in `76` scheduler steps, averaging `299.8` prompt + tokens and `106.11` decode tokens per step, with up to `35` waiting prompt + slots. +- `prefill_budget_step=0` and `prefill_cap_per_slot=0` confirm stock + n-batch-only prompt admission was used. +- Next candidate should be an A/B of a small, default-off admission policy or a + trace extension with per-step histograms, not another immediate kernel rewrite. + +- [x] **Step 2: Commit LocalAI docs** + +Commit this plan and parity doc updates with `Assisted-by: Codex:gpt-5`. diff --git a/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md new file mode 100644 index 000000000000..38175c7d487d --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md @@ -0,0 +1,86 @@ +# Phase47 Dense Serving Snapshot Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use the newly parameterized harness to collect an audited dense paged-vs-vLLM serving snapshot, without changing inference code. + +**Architecture:** Run `paged-current-serving-snapshot.sh` against the dense GGUF and dense vLLM model with `SERVED_MODEL_NAME=dense-q36`. Keep the standard pre/post paged inference gates and `MUL_MAT,MUL_MAT_ID` op checks. + +**Tech Stack:** Bash serving harness, DGX, LocalAI parity docs. + +--- + +### Task 1: Dry-run dense snapshot inputs + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run DGX dry-run** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, docker/local-ai-worker/GPU compute all zero, dense model paths validated, and `SERVED_MODEL_NAME=dense-q36` printed. + +### Task 2: Run audited dense serving snapshot + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run full dense snapshot after Phase48 hardening** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase47_dense_serving/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1 8 32 128" PARALLEL=128 CTX=131072 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: full run exits `0`, pre/post gates are green, and `summary.tsv` contains paged-vs-vLLM ratios for `n=1/8/32/128`. + +First attempt status: incomplete at +`/home/mudler/bench/phase47_dense_serving/20260701_095151`. Pre-gates and the +paged arm completed, but vLLM startup exceeded the old fixed readiness budget +and produced no vLLM result JSONs. Retry only after Phase48 readiness hardening. + +Retry status: completed at +`/home/mudler/bench/phase47_dense_serving_retry/20260701_100811` after Phase48 +with `VLLM_READY_ATTEMPTS=700`. + +### Task 3: Record dense snapshot result + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md` + +- [x] **Step 1: Summarize artifact outputs** + +Record the dry-run artifact, full snapshot artifact, pre/post md5/op gate status, and the ratio rows from `summary.tsv`. + +- [x] **Step 2: Mark completed plan items** + +Mark this plan's checkboxes complete only after the corresponding command or docs update has happened. + +### Task 4: Commit + +**Files:** +- Commit Phase47 docs and plan changes. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md +git commit -m "docs(paged): record dense serving snapshot" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md b/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md new file mode 100644 index 000000000000..5e0e84445e31 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md @@ -0,0 +1,415 @@ +# Phase50 Dense True Decode Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Measure dense Qwen3.5 true steady decode on GB10 for paged llama.cpp and vLLM, separated from h2h TTFT/prefill-overlap accounting, while proving inference output and touched backend ops remain unchanged before and after the run. + +**Architecture:** Do not change inference code. Run canonical pre/post paged inference gates, then collect graph-node-traced nsys profiles for dense paged llama.cpp and dense vLLM using the difference method: `ntg=64 - ntg=16` at the same `npl=128`, `npp=128` shape. Record the result in the parity docs and keep the next code target limited to scheduler/admission tracing only if true steady decode does not explain the Phase47 high-N serving gap. + +**Tech Stack:** DGX GB10 over `ssh dgx.casa`, llama.cpp fork build in `~/llama-phase6-source/build-cuda` for `llama-batched-bench`, vLLM 0.23.0 in `~/vllm-bench`, `nsys --cuda-graph-trace=node`, LocalAI parity docs. + +--- + +### Task 1: Confirm DGX is idle and acquire an artifact directory + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//preflight.txt` +- Create on DGX: `~/bench/phase50_dense_true_decode//hardware.txt` +- Create on DGX: `~/bench/phase50_dense_true_decode//run.log` +- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Check the DGX preflight** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$HOME/bench/phase50_dense_true_decode/$(date +%Y%m%d_%H%M%S) +mkdir -p "$ART" +{ + printf "docker="; docker ps -q | wc -l + printf "local_ai_worker="; docker ps --format "{{.Names}}" | grep -c local-ai-worker || true + printf "compute="; nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l + printf "owner="; if [ -f "$HOME/gpu_bench_lock/owner" ]; then cat "$HOME/gpu_bench_lock/owner"; else echo FREE-no-lock-file; fi +} | tee "$ART/preflight.txt" +nvidia-smi -L | tee "$ART/hardware.txt" +nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv,noheader | tee -a "$ART/hardware.txt" +echo "$ART"' +``` + +Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and `owner=FREE...`. + +- [x] **Step 2: Acquire the owner-file lock** + +Run with `ART` set to the printed artifact directory: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +mkdir -p "$HOME/gpu_bench_lock" +echo "codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner" +cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"' +``` + +Expected: owner starts with `codex-phase50-dense-true-decode`. + +Actual artifact: `/home/mudler/bench/phase50_dense_true_decode/20260701_103120`. +Preflight was clean: docker `0`, `local-ai-worker` `0`, compute `0`, owner +`FREE released-by-codex-current-serving-snapshot 1782893824`. + +### Task 2: Run pre-profile inference gates + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_pre/` +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_pre.log` + +- [x] **Step 1: Run the canonical paged gate helper** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-phase36/bin" \ +ART="$ART/gate_pre" \ +OPS=MUL_MAT,MUL_MAT_ID \ + "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_pre.log"' +``` + +Expected: + +```text +paged inference gates OK +``` + +Required values: +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` +- `MUL_MAT_ID`: `806/806` + +Actual: `build-phase36` pre-gate passed, then `build-cuda` pre-gate also +passed because `build-phase36/bin` does not contain `llama-batched-bench`. +The profiled binary set is therefore `~/llama-phase6-source/build-cuda/bin`. + +### Task 3: Profile dense paged llama.cpp true decode + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg16.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg16.bench.log` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg64.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//paged_dense_n128_ntg64.bench.log` + +- [x] **Step 1: Run ntg=16 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench" +MODEL="$HOME/bench/q36-27b-nvfp4.gguf" +REP="$ART/paged_dense_n128_ntg16" +rm -f "$REP.nsys-rep" "$REP.sqlite" +nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \ + --force-overwrite=true -o "$REP" \ + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg 16 -npl 128 > "$REP.bench.log" 2>&1 +grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40' +``` + +Expected: command exits 0 and writes `paged_dense_n128_ntg16.nsys-rep`. + +Actual: `T_TG=5.754s`, `S_TG=355.93 t/s`; report +`paged_dense_n128_ntg16.nsys-rep` written. + +- [x] **Step 2: Run ntg=64 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin/llama-batched-bench" +MODEL="$HOME/bench/q36-27b-nvfp4.gguf" +REP="$ART/paged_dense_n128_ntg64" +rm -f "$REP.nsys-rep" "$REP.sqlite" +nsys profile --cuda-graph-trace=node --trace=cuda,nvtx --sample=none --cpuctxsw=none \ + --force-overwrite=true -o "$REP" \ + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 \ + "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg 64 -npl 128 > "$REP.bench.log" 2>&1 +grep -E "model|\\| *128|llama_perf|error|Error|Traceback" "$REP.bench.log" | tail -40' +``` + +Expected: command exits 0 and writes `paged_dense_n128_ntg64.nsys-rep`. + +Actual: `T_TG=21.768s`, `S_TG=376.33 t/s`; report +`paged_dense_n128_ntg64.nsys-rep` written. + +### Task 4: Profile dense vLLM true decode + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_decode_prof.py` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg16.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg16.run.log` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg64.nsys-rep` +- Create on DGX: `~/bench/phase50_dense_true_decode//vllm_dense_n128_ntg64.run.log` + +- [x] **Step 1: Write the vLLM dense profile driver** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +cat > "$ART/vllm_dense_decode_prof.py" <<'"'"'PY'"'"' +import os, time, torch +os.environ["HF_HUB_OFFLINE"] = "1" +os.environ["VLLM_LOGGING_LEVEL"] = "WARNING" +os.environ["VLLM_ENABLE_V1_MULTIPROCESSING"] = "0" +from vllm import LLM, SamplingParams +from vllm.inputs import TokensPrompt + +MODEL = os.environ.get("MODEL", "/home/mudler/bench/q36-27b-nvfp4-vllm") +NSEQ = int(os.environ.get("NSEQ", "128")) +PROMPT_TOKS = int(os.environ.get("PT", "128")) +GEN = int(os.environ.get("GEN", "64")) + +llm = LLM( + model=MODEL, + enforce_eager=False, + max_model_len=4096, + gpu_memory_utilization=0.85, + max_num_seqs=256, + tensor_parallel_size=1, + enable_prefix_caching=False, + disable_log_stats=True, +) +prompts = [ + TokensPrompt(prompt_token_ids=[1000 + (i * 7 + j * 13) % 30000 for j in range(PROMPT_TOKS)]) + for i in range(NSEQ) +] +sp = SamplingParams(temperature=0.0, max_tokens=GEN, ignore_eos=True, min_tokens=GEN) +print(f"dense vLLM NSEQ={NSEQ} PT={PROMPT_TOKS} GEN={GEN} warmup...", flush=True) +llm.generate(prompts, sp, use_tqdm=False) +torch.cuda.synchronize() +print("PROFILED GENERATE START", flush=True) +torch.cuda.cudart().cudaProfilerStart() +t0 = time.time() +outs = llm.generate(prompts, sp, use_tqdm=False) +torch.cuda.synchronize() +t1 = time.time() +torch.cuda.cudart().cudaProfilerStop() +ntok = sum(len(o.outputs[0].token_ids) for o in outs) +print(f"PROFILED END seqs={len(outs)} gen_tok={ntok} wall={t1-t0:.3f}s tok/s={ntok/(t1-t0):.1f} incl_prefill", flush=True) +PY' +``` + +Expected: `vllm_dense_decode_prof.py` exists in the artifact directory. + +Actual: used an equivalent self-contained `python -c` target under nsys instead +of writing a DGX source script. No inference code or repo file was changed. + +- [x] **Step 2: Run ntg=16 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +REP="$ART/vllm_dense_n128_ntg16" +rm -f "$REP.nsys-rep" "$REP.sqlite" +PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=16 \ +nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \ + --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \ + "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1 +grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20' +``` + +Expected: command exits 0 and writes `vllm_dense_n128_ntg16.nsys-rep`. + +Actual: profiled generate `2048` tokens in `13.041s`; report +`vllm_dense_n128_ntg16.nsys-rep` written. + +- [x] **Step 3: Run ntg=64 graph-node profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +REP="$ART/vllm_dense_n128_ntg64" +rm -f "$REP.nsys-rep" "$REP.sqlite" +PATH="$HOME/vllm-bench/bin:$PATH" HF_HUB_OFFLINE=1 NSEQ=128 PT=128 GEN=64 \ +nsys profile --cuda-graph-trace=node --capture-range=cudaProfilerApi --capture-range-end=stop \ + --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true -o "$REP" \ + "$HOME/vllm-bench/bin/python" "$ART/vllm_dense_decode_prof.py" > "$REP.run.log" 2>&1 +grep -E "PROFILED|Error|error|Traceback" "$REP.run.log" | tail -20' +``` + +Expected: command exits 0 and writes `vllm_dense_n128_ntg64.nsys-rep`. + +Actual: profiled generate `8192` tokens in `27.165s`; report +`vllm_dense_n128_ntg64.nsys-rep` written. + +### Task 5: Compute the difference-method summary + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//summary.tsv` +- Create on DGX: `~/bench/phase50_dense_true_decode//profile_files.txt` + +- [x] **Step 1: Parse paged and vLLM throughput rows** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +python3 - "$ART" <<'"'"'PY'"'"' +import pathlib, re, sys +art = pathlib.Path(sys.argv[1]) + +def paged_ttg(name): + text = (art / f"{name}.bench.log").read_text(errors="replace") + rows = [line for line in text.splitlines() if "| 128 |" in line or "| 128 |" in line] + if not rows: + rows = [line for line in text.splitlines() if re.search(r"\|\s*128\s*\|", line)] + if not rows: + raise SystemExit(f"missing paged row in {name}.bench.log") + parts = [p.strip() for p in rows[-1].split("|") if p.strip()] + # columns: PP, TG, B, N_KV, T_PP, S_PP, T_TG, S_TG, T, S + return float(parts[6]), float(parts[7]) + +def vllm_wall(name): + text = (art / f"{name}.run.log").read_text(errors="replace") + m = re.search(r"PROFILED END seqs=(\d+) gen_tok=(\d+) wall=([0-9.]+)s", text) + if not m: + raise SystemExit(f"missing vLLM PROFILED END in {name}.run.log") + return int(m.group(1)), int(m.group(2)), float(m.group(3)) + +p16_ttg, p16_stg = paged_ttg("paged_dense_n128_ntg16") +p64_ttg, p64_stg = paged_ttg("paged_dense_n128_ntg64") +v16_seq, v16_tok, v16_wall = vllm_wall("vllm_dense_n128_ntg16") +v64_seq, v64_tok, v64_wall = vllm_wall("vllm_dense_n128_ntg64") +paged_delta_tokens = 128 * (64 - 16) +paged_delta_wall = p64_ttg - p16_ttg +vllm_delta_tokens = v64_tok - v16_tok +vllm_delta_wall = v64_wall - v16_wall +paged_decode = paged_delta_tokens / paged_delta_wall +vllm_decode = vllm_delta_tokens / vllm_delta_wall +with (art / "summary.tsv").open("w") as f: + f.write("engine\tshape\tntg16_wall_s\tntg64_wall_s\tdelta_tokens\tdelta_wall_s\ttrue_decode_tps\n") + f.write(f"paged\tdense_n128_pt128\t{p16_ttg:.3f}\t{p64_ttg:.3f}\t{paged_delta_tokens}\t{paged_delta_wall:.3f}\t{paged_decode:.2f}\n") + f.write(f"vllm\tdense_n128_pt128\t{v16_wall:.3f}\t{v64_wall:.3f}\t{vllm_delta_tokens}\t{vllm_delta_wall:.3f}\t{vllm_decode:.2f}\n") + f.write(f"ratio\tpaged_over_vllm\t\t\t\t\t{paged_decode / vllm_decode:.4f}\n") +print((art / "summary.tsv").read_text()) +PY +ls -1 "$ART"/*.nsys-rep "$ART"/*.log > "$ART/profile_files.txt"' +``` + +Expected: `summary.tsv` contains `paged`, `vllm`, and `ratio` rows. + +Actual: + +| engine | ntg16 wall s | ntg64 wall s | delta tokens | delta wall s | true decode t/s | +|--------|--------------|--------------|--------------|--------------|-----------------| +| paged | `5.754` | `21.768` | `6144` | `16.014` | `383.66` | +| vLLM | `13.041` | `27.165` | `6144` | `14.124` | `435.00` | +| ratio | | | | | `0.8820` | + +### Task 6: Run post-profile inference gates and release DGX + +**Files:** +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_post/` +- Create on DGX: `~/bench/phase50_dense_true_decode//gate_post.log` +- Modify later: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Run the canonical paged gate helper again** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +BIN="$HOME/llama-phase6-source/build-cuda/bin" \ +ART="$ART/gate_post" \ +OPS=MUL_MAT,MUL_MAT_ID \ + "$HOME/paged-inference-gates.sh" 2>&1 | tee "$ART/gate_post.log"' +``` + +Expected: + +```text +paged inference gates OK +``` + +Actual: `build-cuda` post-gate passed with MoE md5 +`8cb0ce23777bf55f92f63d0292c756b0`, dense md5 +`5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT` `1146/1146`, and +`MUL_MAT_ID` `806/806`. + +- [x] **Step 2: Release the owner-file lock** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART="$HOME/bench/phase50_dense_true_decode/REPLACE_WITH_TIMESTAMP" +echo "FREE released-by-codex-phase50-dense-true-decode $(date +%s)" > "$HOME/gpu_bench_lock/owner" +cat "$HOME/gpu_bench_lock/owner" | tee -a "$ART/run.log"' +``` + +Expected: owner starts with `FREE released-by-codex-phase50-dense-true-decode`. + +Actual: owner `FREE released-by-codex-phase50-dense-true-decode 1782895927`; +docker `0`, `local-ai-worker` `0`, compute `0`. + +### Task 7: Record the result and choose the next code target + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md` + +- [x] **Step 1: Mark completed plan steps** + +Update every completed checkbox in this file. Leave failed or skipped steps unchecked and add a short note with the artifact path and failure reason. + +- [x] **Step 2: Add the Phase50 result to the parity docs** + +Record: +- artifact directory +- preflight result +- pre/post gate md5 and op-count values +- paged true decode, vLLM true decode, and ratio from `summary.tsv` +- whether Phase47 high-N serving loss is a true GPU decode gap or mostly scheduler/accounting + +Actual: recorded the artifact, preflight, gates, true-decode table, and +decision in `GB10_PARITY_PHASE0_RESULTS.md`, `VLLM_PARITY_LEVER_MAP.md`, and +`PARITY_HANDOFF.md`. Interpretation: a real dense decode gap remains, but it is +about `12%`; the larger Phase47 high-N serving loss points at +scheduler/admission and prefill-overlap/accounting. + +- [x] **Step 3: Commit the documentation-only result** + +Run: + +```bash +git status --short +git add docs/superpowers/plans/2026-07-01-dense-true-decode-phase50.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): record dense true decode profile" -m "Assisted-by: Codex:gpt-5" +``` + +Expected: commit succeeds and `.claude/` remains the only unrelated untracked path. + +## Self-Review + +- Spec coverage: covers inference safety via pre/post md5 and op checks, true steady decode via graph-node nsys difference method, and docs/plan phase tracking. +- Placeholder scan: no `TBD`, `TODO`, or unspecified test commands. +- Type consistency: the artifact path placeholder is consistently `REPLACE_WITH_TIMESTAMP`; replace it with the actual timestamp before running each command. diff --git a/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md b/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md new file mode 100644 index 000000000000..a9ec72dbf057 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md @@ -0,0 +1,158 @@ +# Gate Fusion Feasibility Phase39 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether to implement a quick fused F32 router/shared-expert gate projection after Phase38. + +**Architecture:** Phase39 is evidence-first and source-conservative. It compares the Phase37 tensor-name trace, the Phase27 graph-node serving profile, and llama.cpp graph/model-loader capabilities. It rejects graph-time weight concatenation because it would add layout-copy work in a bucket that is already measurable, and scopes the only acceptable follow-up as a persistent/load-time combined-weight design with md5/op/KL gates. + +**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, Nsight Systems, vLLM Qwen3-Next fused-MoE source comparison. + +--- + +### Task 1: Inspect current graph/model support + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/include/ggml.h` +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py` + +- [x] **Step 1: Confirm llama.cpp gate tensors** + +`qwen35moe.cpp` creates: + +```cpp +layer.ffn_gate_inp = create_tensor(..., { n_embd, n_expert }, flags); +layer.ffn_gate_inp_shexp = create_tensor(..., { n_embd }, flags); +``` + +and computes: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...) +build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur) +``` + +- [x] **Step 2: Confirm ggml graph-time concat is available but not free** + +`ggml.h` exposes `ggml_concat()` and `ggml_view_*()`, so a graph-time fused +gate is syntactically possible. It would require building a temporary combined +weight in the compute graph unless the model loader creates a persistent +combined tensor. + +- [x] **Step 3: Confirm vLLM's relevant idea** + +vLLM's fused-MoE runner concatenates router and shared-expert gate weights into +`_combined_gate_weight`. The useful design pattern is persistent F32 combined +gate weight, not BF16/NVFP4 routing. + +### Task 2: Reuse existing serving evidence + +**Files:** +- Artifact: `dgx.casa:/home/mudler/bench/phase37_cublas_name_trace/20260701_083227` +- Artifact: `dgx.casa:/home/mudler/bench/phase27_graph_node_serving/20260701_055519` +- Artifact: `dgx.casa:/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis` + +- [x] **Step 1: Read Phase37 route-name evidence** + +Observed: + +```text +2884 route=bf16_tc +1212 route=sgemm +16 route=sgemm type=0 src0=blk.N.ffn_gate_inp.weight src1=attn_post_norm-N dst=ffn_moe_logits-N +16 route=sgemm type=0 src0=blk.N.ffn_gate_inp_shexp.weight src1=attn_post_norm-N dst=shared_expert_gate-N +``` + +- [x] **Step 2: Re-analyze Phase27 graph-node serving profile** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=/home/mudler/bench/phase39_gate_sgemm_profile/phase27_reanalysis; SRC=/home/mudler/bench/phase27_graph_node_serving/20260701_055519/llama_graph_node.nsys-rep; mkdir -p "$ART"; nsys stats --report cuda_gpu_kern_sum,cuda_api_sum --format csv --output "$ART/phase27" "$SRC"' +``` + +Observed serving kernel buckets: + +```text +TOTAL kernel time: 20.0372 s +cublas_bf16_gemm 1892.81ms 9.45% +cutlass_bf16_gemm 684.01ms 3.41% +concat_layout 459.84ms 2.29% +``` + +Top raw kernel evidence includes: + +```text +concat_non_cont 459.84ms 2.3% 2250 instances +``` + +### Task 3: Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Reject graph-time fused gate via `ggml_concat`** + +Do not implement a quick graph-time combined gate that concatenates +`ffn_gate_inp` and `ffn_gate_inp_shexp` inside the compute graph. It risks +adding work to the existing `concat_layout` bucket (`459.84ms`, `2.29%`) before +removing enough SGEMM overhead, and it would be a high-conflict graph/model edit +without clear upside. + +- [x] **Step 2: Preserve the only acceptable follow-up shape** + +The only follow-up worth scoping is a persistent/load-time F32 combined gate +weight: + +```text +combined_gate_weight = concat_rows(ffn_gate_inp.weight, + ffn_gate_inp_shexp.weight) +``` + +Requirements: + +- default-off until gates pass; +- no BF16/NVFP4 conversion for gate weights; +- no graph-time weight concat; +- split combined output into `ffn_moe_logits` and `shared_expert_gate` views; +- MoE/dense md5 must match before serving benchmarks; +- `MUL_MAT` and `MUL_MAT_ID` must pass; +- if md5 changes, run KL first and reject on KL regression. + +### Task 4: Verify and commit docs + +**Files:** +- Modify: `docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Check docs diff** + +Run: + +```bash +git diff -- docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +``` + +Expected: only Phase39 documentation changes. + +- [x] **Step 2: Commit** + +Run: + +```bash +git add -f docs/superpowers/plans/2026-07-01-gate-fusion-feasibility-phase39.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): reject graph-time gate fusion shortcut" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md b/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md new file mode 100644 index 000000000000..2b3d0a27576e --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md @@ -0,0 +1,164 @@ +# Gate Projection Policy Phase38 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether the Phase37 `ffn_gate_inp*` SGEMM bucket is a safe vLLM-parity lever without breaking inference. + +**Architecture:** Treat router logits and shared-expert gate projections as inference-critical F32 policy until proven otherwise. Phase38 is analysis-first: record the source/vLLM comparison, strengthen the default inference gate, and only allow later route changes behind md5/op gates plus KL if byte output changes. + +**Tech Stack:** LocalAI paged llama.cpp backend, llama.cpp CUDA fork, DGX GB10, vLLM Qwen3-Next fused-MoE code, `paged-inference-gates.sh`. + +--- + +### Task 1: Establish a fresh inference baseline + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` +- Artifact: `dgx.casa:/home/mudler/bench/phase38_gate_baseline/20260701_084410` + +- [x] **Step 1: Verify DGX is idle** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; echo owner=$(cat ~/gpu_bench_lock/owner 2>/dev/null || true); echo docker=$(docker ps -q | wc -l); echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true); echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); nvidia-smi --query-gpu=name,driver_version --format=csv,noheader' +``` + +Observed: + +```text +owner=FREE phase33-small-m-tile-policy-done 1782883234 +docker=0 +local_ai_worker=0 +compute=0 +NVIDIA GB10, 580.159.03 +``` + +- [x] **Step 2: Run canonical md5 and op gates** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase38_gate_baseline/$(date +%Y%m%d_%H%M%S); mkdir -p "$ART"; BIN=$HOME/llama-phase6-source/build-phase36/bin ART="$ART" OPS=MUL_MAT,MUL_MAT_ID $HOME/paged-inference-gates.sh | tee "$ART/gate.log"' +``` + +Observed: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 +1146/1146 tests passed +Backend CUDA0: OK +806/806 tests passed +Backend CUDA0: OK +paged inference gates OK +artifacts: /home/mudler/bench/phase38_gate_baseline/20260701_084410 +``` + +### Task 2: Strengthen the reusable inference gate + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` + +- [x] **Step 1: Make both matmul op gates default** + +Change: + +```bash +OPS=${OPS:-MUL_MAT_ID} +``` + +to: + +```bash +OPS=${OPS:-MUL_MAT,MUL_MAT_ID} +``` + +Also update `--help` text so the default is visible. + +- [x] **Step 2: Verify shell syntax and help output** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh +backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh --help | grep 'default: MUL_MAT,MUL_MAT_ID' +``` + +Expected: exit 0 and the updated default line is printed. + +### Task 3: Record the Phase37 to Phase38 policy decision + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/models/qwen3_next.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fused_moe/runner/moe_runner.py` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Source inspection result** + +`qwen35moe.cpp` creates `ffn_gate_inp.weight` as `[n_embd, n_expert]` and `ffn_gate_inp_shexp.weight` as `[n_embd]`. The graph uses: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...) +build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur) +``` + +`llama-graph.cpp` computes router logits through `build_lora_mm(gate_inp, cur)` and labels the result `ffn_moe_logits`. + +- [x] **Step 2: vLLM comparison result** + +`qwen3_next.py` constructs both gates as `ReplicatedLinear(..., quant_config=None)`. `moe_runner.py` can concatenate `gate.weight` and `shared_expert_gate.weight` into `_combined_gate_weight` for fused shared-expert routing. + +- [x] **Step 3: Decision** + +The SGEMM bucket is not an accidental slow path. It is router/shared-expert gate math kept unquantized by both llama.cpp and vLLM. Do not force BF16 or NVFP4 for `ffn_gate_inp*`. The safe follow-up lever is a default-off fused gate projection experiment that preserves F32 math and split semantics, or a diagnostic proof that the two current SGEMMs are too small to matter. + +- [ ] **Step 4: Gate any later fused-gate experiment** + +Before benchmarking any code change: + +```bash +BIN=$HOME/llama-phase6-source/build-phase36/bin \ +ART=$HOME/bench/phase38_gate_fused_candidate \ +OPS=MUL_MAT,MUL_MAT_ID \ +$HOME/paged-inference-gates.sh +``` + +If either md5 differs, stop and run the KL gate before serving benchmarks. If either op gate fails, reject the candidate. + +### Task 4: Commit the docs and gate-script update + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh` +- Modify: `docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Run local syntax checks** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh +``` + +Expected: exit 0. + +- [x] **Step 2: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-inference-gates.sh \ + docs/superpowers/plans/2026-07-01-gate-projection-policy-phase38.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): scope gate projection policy" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md new file mode 100644 index 000000000000..7adf3c85ecc0 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-c32-slab-phase10.md @@ -0,0 +1,310 @@ +# GDN C32 Slab Phase 10 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether a C=32, `dv_tile=64` slabbed M5-style tensor-core GDN prefill path beats the current C=16 M5 kernel without changing decode. + +**Architecture:** Phase 10 is a fork-first, profile-gated GDN prefill experiment. It does not revisit the rejected decode `GDN_NW/GDN_CPW` env grid. The candidate keeps the current tensor-core M5 math shape, slabs the value dimension into two `dv=64` blocks to fit dynamic shared memory, and initially recomputes A/T per slab to prove or reject the geometry before optimizing shared work. + +**Tech Stack:** llama.cpp CUDA GDN kernel, GB10 sm_121 CUDA build, Qwen3.6 NVFP4 MoE/dense GGUF, canonical md5/KL gates. + +--- + +## Guardrails + +- Keep `GDN_CHUNK_MIN > 1`; decode must never route into the chunked prefill prototype. +- Compare against current M5 (`GDN_TC=5`, `GDN_CHUNK_MIN=64`), not against old sequential GDN. +- Build-vs-build A/B only; do not accept a standalone PoC win. +- Keep the candidate default-off behind an explicit env selector until it clears correctness and performance gates. +- Run canonical md5 gates after any source change: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. + +## Task 1: Baseline Current M5 + +**Files:** +- Read-only: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` +- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/` + +- [x] **Step 1: Check DGX is free** + +Run the standard DGX preflight: + +```bash +ssh dgx.casa 'set -e +echo docker=$(docker ps -q | wc -l) +echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) +echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' +``` + +Expected: + +```text +docker=0 +local_ai_worker=0 +compute=0 +FREE... +``` + +- [x] **Step 2: Record current source provenance** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD' +``` + +Expected: clean or only the current phase commit. + +- [x] **Step 3: Run current M5 prefill baseline** + +Run MoE and dense prefill at `npp=512` and `npp=2048` with: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1 +``` + +Record S_PP, kernel bucket summaries, and artifacts under: + +```text +/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/ +``` + +Result: + +| Model | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|----|----|---|----------|----------|-------| +| MoE | 512 | 4 | 32 | 2314.18 | 359.16 | 2220.48 | +| MoE | 2048 | 4 | 32 | 2439.95 | 389.43 | 2415.16 | +| Dense | 512 | 4 | 32 | 978.97 | 143.56 | 936.71 | +| Dense | 2048 | 4 | 32 | 1023.61 | 184.09 | 1014.59 | + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_moe_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/paged_dense_prefill.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/summary_rows.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/m5_baseline/provenance.txt` + +## Task 2: Add Default-Off C32 Slab Candidate + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` + +### Source Inspection Result + +- [x] **Step 0: Check whether C32 can reuse the current M5 body** + +Result: no safe launcher-only shortcut exists for C32 M5. + +The current M5 code path is structurally specialized to `C<=16` in the form-T +solve/apply stage: + +- `gated_delta_net_chunked_cuda` stores the full `U=T*RHS` + output in registers before overwriting `Ud`, avoiding read/write aliasing. +- For `C=16`, one `m16` row tile covers all chunk rows. +- For `C=32`, there are two row tiles. Writing the first tile to `Ud` before + computing the second would corrupt the RHS reads for the second tile. +- The current code also calls the apply helper with rowbase `0` only in the M5 + solve path, so a naive `launch_gdn_chunked<128, 32, TC=4>` would be + incomplete even if dynamic shared memory fit. + +Implication: + +- Do not add `GDN_C32_SLAB=1` by only changing launch dimensions. +- A correct C32 slab patch must first add a separate `U=T*RHS` staging strategy: + either a slab-local temporary buffer for all `C*DV_TILE` U values, or a + two-pass apply that preserves the original RHS until all row tiles are + computed. +- Because the candidate changes the solve/apply mechanics, it requires a + focused `GATED_DELTA_NET` op gate before any prefill A/B. + +- [x] **Step 1: Add an explicit env selector** + +Use an env var such as: + +```text +GDN_C32_SLAB=1 +``` + +The default path must stay current M5. + +- [x] **Step 2: Introduce a C=32, dv_tile=64 launch** + +Target shape: + +```cpp +launch_gdn_chunked_slab<128, 32, 64, TC_>(...) +``` + +Initial prototype rules: + +- one slab block per `(head, seq, dv_tile)`, +- two slab blocks cover `dv=128`, +- recompute `A/T` per slab for simplicity, +- no decode routing, +- no D2H synchronization. + +- [x] **Step 3: Build on DGX** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target llama-completion test-backend-ops -j 8' +``` + +Expected: build succeeds. + +Result: + +- Candidate implemented in the llama.cpp fork as a default-off + `GDN_C32_SLAB=1` path. +- Kernel generalized to `DV_TILE=64`, with two value slabs for `S_v=128`. +- C32 `U=T*RHS` writes were staged through a slab-local `Utmp` buffer to avoid + read/write aliasing against the RHS in `Ud`. +- Initial md5 failed on dense because tail chunks copied uninitialized staged + rows back into `Ud`; the root-cause fix zeroed `t >= Cc` rows during the + staged copy-back. +- DGX build succeeded after the tail fix: + `cmake --build . --target test-backend-ops llama-completion -j 8`. + +## Task 3: Correctness Gates + +**Files:** +- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/gates/` + +- [x] **Step 1: Run `GATED_DELTA_NET` op gate** + +Run default and forced C32 slab modes: + +```bash +./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 +GDN_C32_SLAB=1 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 +``` + +Required coverage to inspect in logs: + +- multi-chunk, +- tail chunk, +- multi-seq, +- GQA, +- permuted layout, +- adversarial decay. + +- [x] **Step 2: Run canonical md5 gates** + +Run MoE and dense greedy gates with and without `GDN_C32_SLAB=1`. + +Expected: + +```text +MoE 8cb0ce23777bf55f92f63d0292c756b0 +Dense 5951a5b4d624ce891e22ab5fca9bc439 +``` + +- [x] **Step 3: Run KL gate if md5 changes** + +If the C32 slab path changes reduction order and therefore md5, stop and run the +existing KL procedure from `PAGED_BITEXACT_NOTE.md`. Keep the patch only if the +new path is KL-benign and no worse than current M5. + +Result: + +- Default op gate: + `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_default_after_tailfix.txt` +- Forced C32 op gate: + `/home/mudler/bench/phase10_gdn_c32_slab/gates/gated_delta_net_c32_slab_after_tailfix.txt` +- Both `GATED_DELTA_NET` CUDA0 gates passed. +- Canonical default md5 after tail fix: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense: `5951a5b4d624ce891e22ab5fca9bc439` +- Forced C32 md5 after tail fix: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense: `5951a5b4d624ce891e22ab5fca9bc439` +- KL gate was not needed because the md5 gates matched the canonical outputs + exactly after the tail-row fix. + +## Task 4: Performance A/B + +**Files:** +- Artifact: `/home/mudler/bench/phase10_gdn_c32_slab/ab/` + +- [x] **Step 1: Run C32 slab prefill at `npp=512`** + +Compare: + +```text +baseline: GDN_TC=5 GDN_CHUNK_MIN=64 +candidate: GDN_TC=5 GDN_CHUNK_MIN=64 GDN_C32_SLAB=1 +``` + +Pass: candidate beats current M5 S_PP outside noise. + +- [x] **Step 2: Run C32 slab prefill at `npp=2048`** + +Use the same A/B. Pass requires a net S_PP improvement or a clear GDN bucket +reduction without a larger regression elsewhere. + +- [x] **Step 3: Reject if duplicated A/T work cancels the state-traffic win** + +If the candidate only shifts time between A/T recomputation and state traffic +without a net win, save the diff as a rejected artifact and update this plan. + +Result: + +Artifacts: + +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/moe_c32.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_base.txt` +- `/home/mudler/bench/phase10_gdn_c32_slab/ab/dense_c32.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2323.48 | 397.57 | 2239.39 | +| MoE | C32 slab | 512 | 4 | 32 | 2069.12 | 357.43 | 1995.06 | +| MoE | M5 base | 2048 | 4 | 32 | 2430.32 | 388.29 | 2405.66 | +| MoE | C32 slab | 2048 | 4 | 32 | 2054.86 | 388.01 | 2037.79 | +| Dense | M5 base | 512 | 4 | 32 | 975.10 | 140.53 | 932.19 | +| Dense | C32 slab | 512 | 4 | 32 | 866.29 | 144.03 | 833.87 | +| Dense | M5 base | 2048 | 4 | 32 | 1019.25 | 183.25 | 1010.26 | +| Dense | C32 slab | 2048 | 4 | 32 | 903.73 | 183.47 | 896.86 | + +Decision: + +- Reject the C32 slab source patch. +- The candidate is correctness-clean after tail-row zeroing, but it regresses + S_PP in both model families. +- The likely mechanism is that recomputing `A/T` once per value slab cancels + the intended state-traffic win; optimizing this would require a broader + shared-work design rather than a small, low-conflict shortcut patch. +- Rejected diff saved at: + `/home/mudler/bench/phase10_gdn_c32_slab/rejected/c32_slab_tailfix_rejected.diff`. + +## Task 5: Mirror or Reject + +**Files:** +- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Commit accepted fork patch** + +Commit only after correctness and performance gates pass. + +- [x] **Step 2: Generate LocalAI patch** + +Use `git format-patch`; do not hand-edit the generated patch. + +- [x] **Step 3: Update docs** + +Record exact artifacts, md5/KL results, and performance decision. + +Result: + +- No fork commit and no LocalAI patch were generated because Phase 10 was + rejected by the performance gate. +- The llama.cpp fork and DGX mirror were restored to the prior accepted state. +- This plan and the parity docs record the rejected source candidate so it is + not repeated as an accidental "obvious" follow-up. diff --git a/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md new file mode 100644 index 000000000000..448cae3bc606 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md @@ -0,0 +1,462 @@ +# GDN Global-Ai Prototype Phase 13 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Implement and test a default-off C32 GDN prefill prototype that computes f32 Ai once per chunk/head and reuses it across two value slabs. + +**Architecture:** The prototype adds one Ai precompute kernel plus one Ai-consuming chunked kernel in `gated_delta_net.cu`. Scratch is allocated from the existing ggml CUDA pool in `ggml_cuda_op_gated_delta_net`, scoped to the op, and only used when `GDN_GLOBAL_AI32=1`. + +**Tech Stack:** llama.cpp CUDA, ggml CUDA pool allocator, GB10 DGX benchmark harness, Qwen3.6 NVFP4 GGUF gates. + +--- + +## Guardrails + +- Default path remains current C16 M5. +- Candidate engages only with `GDN_GLOBAL_AI32=1`. +- Prototype only supports `S_v=128`, `C=32`, `DV_TILE=64`, f32 Ai. +- Keep `GDN_CHUNK_MIN > 1`; decode must never use this path. +- Do not add f16/BF16 Ai until f32 Ai wins. +- Do not generate a LocalAI patch unless the fork implementation passes gates + and improves S_PP. + +## Task 1: Preflight + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/` + +- [x] **Step 1: Check DGX is free** + +Run: + +```bash +ssh dgx.casa 'set -e +echo docker=$(docker ps -q | wc -l) +echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) +echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' +``` + +Expected: + +```text +docker=0 +local_ai_worker=0 +compute=0 +FREE... +``` + +- [x] **Step 2: Record provenance** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp status --short +git -C /home/mudler/_git/llama.cpp rev-parse HEAD +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD' +``` + +Expected: both llama.cpp trees are clean. + +- [x] **Step 3: Create artifacts** + +Run: + +```bash +ssh dgx.casa 'mkdir -p /home/mudler/bench/phase13_gdn_global_ai32/{gates,ab,rejected}' +``` + +Expected: command exits 0. + +## Task 2: Add Ai Scratch Plumbing + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [x] **Step 1: Add env selector in `ggml_cuda_op_gated_delta_net`** + +Add after `keep_rs` is computed: + +```cpp +static const bool gdn_global_ai32 = []{ + const char * e = getenv("GDN_GLOBAL_AI32"); + return e && atoi(e) != 0; +}(); +``` + +- [x] **Step 2: Allocate Ai scratch only for supported calls** + +Add: + +```cpp +float * ai32_d = nullptr; +int64_t ai32_chunks = 0; +ggml_cuda_pool_alloc ai32_scratch(ctx.pool()); +if (gdn_global_ai32 && !kda && !keep_rs && S_v == 128 && n_tokens > 1) { + ai32_chunks = (n_tokens + 31) / 32; + ai32_d = ai32_scratch.alloc((size_t) n_seqs * H * ai32_chunks * 32 * 32); +} +``` + +Pass `ai32_d` and `ai32_chunks` into the non-KDA/non-keep launch call only. +Other launch calls pass `nullptr, 0`. + +- [x] **Step 3: Extend `launch_gated_delta_net` signature** + +Change the signature to include: + +```cpp +float * ai32_d, int64_t ai32_chunks, +``` + +before `float scale`. Thread these through all four call sites. + +## Task 3: Add Ai Precompute Kernel + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [x] **Step 1: Add `gdn_ai32_cuda`** + +Add a kernel near `gated_delta_net_chunked_cuda`: + +```cpp +template +__global__ void gdn_ai32_cuda( + const float * __restrict__ k, + const float * __restrict__ g, + const float * __restrict__ beta, + float * __restrict__ ai, + int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sb1, int64_t sb2, int64_t sb3, + uint3 neqk1_magic, uint3 rq3_magic) { + // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=chunk. + // Shared: Kc[C*S_v], A[C*C], csh[C], gam[C], bet[C], KKsh[C*C]. + // Compute Kc, prefix csh/gam, KK, A, then exact f32 inverse into ai. +} +``` + +The inverse algorithm must match the existing M5 f32 inverse: + +```cpp +if (j < C) { + if (j < Cc) { + float x[C]; + for (int r = 0; r < C; r++) x[r] = 0.0f; + x[j] = 1.0f; + for (int r = j + 1; r < Cc; r++) { + float acc = 0.0f; + for (int m = j; m < r; m++) acc += A[r * C + m] * x[m]; + x[r] = -acc; + } + for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = x[r]; + } else { + for (int r = 0; r < C; r++) ai[ai_base + r * C + j] = 0.0f; + } +} +``` + +Use fixed stride `C` in scratch, zeroing out-of-range tail rows/columns. + +- [x] **Step 2: Add launcher** + +Add: + +```cpp +template +static void launch_gdn_ai32(..., float * ai32_d, int64_t ai32_chunks, cudaStream_t stream) +``` + +Launch grid: + +```cpp +dim3 grid_dims(H, n_seqs, ai32_chunks); +dim3 block_dims(S_v, 1, 1); +``` + +Dynamic smem: + +```cpp +((size_t) C * S_v + (size_t) C * C + (size_t) 3 * C + (size_t) C * C) * sizeof(float) +``` + +## Task 4: Add Ai-Consuming C32 Slab Kernel + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [x] **Step 1: Add `gated_delta_net_chunked_ai32_cuda`** + +Add a separate kernel rather than overloading the shipped M5 body: + +```cpp +template +__global__ void gated_delta_net_chunked_ai32_cuda( + const float * __restrict__ q, + const float * __restrict__ k, + const float * __restrict__ v, + const float * __restrict__ g, + const float * __restrict__ beta, + const float * __restrict__ curr_state, + float * __restrict__ dst, + const float * __restrict__ ai, + int64_t H, int64_t n_tokens, int64_t n_seqs, + int64_t sq1, int64_t sq2, int64_t sq3, + int64_t sv1, int64_t sv2, int64_t sv3, + int64_t sb1, int64_t sb2, int64_t sb3, + uint3 neqk1_magic, uint3 rq3_magic, + float scale, float * __restrict__ state_dst, + const int32_t * __restrict__ ids, int rs_head) { + // CTA: blockIdx.x=head, blockIdx.y=seq, blockIdx.z=value slab. + // C=32, DV_TILE=64. + // Load the full source state stride S_v*S_v but own only columns [slab*DV_TILE, +DV_TILE). + // For every chunk, load Kc/Qc/csh/gam/bet, build RHS, load Ai, apply U = Ai*RHS, + // build P from QK, compute O, update owned state columns, write owned state columns. +} +``` + +Result: + +- Implemented as a template extension of `gated_delta_net_chunked_cuda` instead + of a separately named kernel, to keep the patch smaller. +- Candidate was selected with `GDN_GLOBAL_AI32=1`. +- The implementation used C32, two `DV_TILE=64` slabs, f32 Ai scratch, and the + Phase 10 tail-row zeroing fix. + +Use the Phase 10 tail-row fix: + +```cpp +Ud[j * C + t] = (t < Cc) ? staged_value : 0.0f; +``` + +and use full state stride for reads/writes: + +```cpp +(int64_t) seq * H * S_v * S_v + (int64_t) h_idx * S_v * S_v +``` + +- [x] **Step 2: Add launcher** + +Add: + +```cpp +template +static void launch_gdn_chunked_ai32(..., const float * ai32_d, int64_t ai32_chunks, ...) +``` + +Launch grid: + +```cpp +dim3 grid_dims(H, n_seqs, S_v / DV_TILE); +dim3 block_dims(DV_TILE, 1, 1); +``` + +The smem formula must stay under the C32 slab Phase 10 budget: + +```cpp +((size_t) S_v * DV_TILE + (size_t) 2 * C * S_v + (size_t) DV_TILE * C + + (size_t) C * C + (size_t) 3 * C + (size_t) C * C ++ (size_t) DV_TILE * C) * sizeof(float) +``` + +Result: + +- DGX build confirmed the smem shape compiled. + +## Task 5: Route Candidate + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [x] **Step 1: Add route in `launch_gated_delta_net`** + +Before the existing `GDN_CHUNKED_LAUNCH` switch: + +```cpp +if (ai32_d != nullptr && ai32_chunks > 0 && S_v == 128 && n_tokens >= gdn_chunk_min) { + launch_gdn_ai32<128, 32>(...); + launch_gdn_chunked_ai32<128, 32, 64>(...); + return; +} +``` + +The route must require `!KDA && !keep_rs_t` via the existing template branch and +must not trigger for decode-sized calls. + +- [x] **Step 2: Keep default path unchanged** + +Run: + +```bash +git diff -- ggml/src/ggml-cuda/gated_delta_net.cu +``` + +Check that default `GDN_TC=5` still launches `launch_gdn_chunked<128, 16, 4>`. + +Result: + +- Default route stayed current M5. +- Candidate route required non-null Ai scratch from `GDN_GLOBAL_AI32=1`. + +## Task 6: Build and Correctness Gates + +**Files:** +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/gates/` + +- [x] **Step 1: Mirror and build** + +Run: + +```bash +rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \ + dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8' +``` + +Expected: build exits 0. + +- [x] **Step 2: Run op gates** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase13_gdn_global_ai32/gates +./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1 +GDN_GLOBAL_AI32=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_global_ai32.txt" 2>&1' +``` + +Expected: both logs show CUDA0 OK for all cases. + +- [x] **Step 3: Run canonical md5 gates** + +Run default and candidate MoE/dense completion gates. Expected: + +```text +MoE 8cb0ce23777bf55f92f63d0292c756b0 +Dense 5951a5b4d624ce891e22ab5fca9bc439 +``` + +If candidate md5 differs, run the KL gate before benchmarking. + +Result: + +- Build passed for `test-backend-ops`, `llama-completion`, and + `llama-batched-bench`. +- Default and forced `GDN_GLOBAL_AI32=1` op gates both reported the same OK + count. +- Default md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- Global-Ai32 md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed. + +## Task 7: Performance A/B + +**Files:** +- Artifact: `/home/mudler/bench/phase13_gdn_global_ai32/ab/` + +- [x] **Step 1: Run same-session A/B** + +Run MoE and dense: + +```bash +LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1" +LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 GGML_NO_BACKTRACE=1" +``` + +Use: + +```bash +./llama-batched-bench -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 +``` + +Expected: candidate improves S_PP without dense regression. + +- [x] **Step 2: Decide** + +Accept only if: + +- op gate passes, +- md5 is canonical or KL-benign, +- MoE S_PP improves, +- dense S_PP does not regress outside noise. + +Reject if flat or slower. + +Result: + +Artifacts: + +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/moe_global_ai32.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_base.txt` +- `/home/mudler/bench/phase13_gdn_global_ai32/ab/dense_global_ai32.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.86 | 396.05 | 2241.21 | +| MoE | Global Ai32 | 512 | 4 | 32 | 2106.50 | 398.55 | 2038.78 | +| MoE | M5 base | 2048 | 4 | 32 | 2425.10 | 389.63 | 2400.66 | +| MoE | Global Ai32 | 2048 | 4 | 32 | 2097.76 | 388.40 | 2079.92 | +| Dense | M5 base | 512 | 4 | 32 | 970.62 | 149.89 | 931.10 | +| Dense | Global Ai32 | 512 | 4 | 32 | 876.51 | 149.29 | 844.62 | +| Dense | M5 base | 2048 | 4 | 32 | 1016.14 | 182.16 | 1007.15 | +| Dense | Global Ai32 | 2048 | 4 | 32 | 918.19 | 183.00 | 911.05 | + +Decision: + +- Reject the global-Ai32 source patch. +- The candidate is correctness-clean but slower in both model families. +- The global scratch/Ai split is not enough to beat the shipped C16 M5 on GB10. + +## Task 8: Mirror or Reject + +**Files:** +- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: If accepted, commit fork patch and generate LocalAI patch** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/gated_delta_net.cu +git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): add GDN global Ai32 prefill prototype" +git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD --stdout \ + > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-GDN-global-Ai32-prefill-prototype.patch +``` + +- [x] **Step 2: If rejected, save diff and restore** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \ + > /home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff +git -C /home/mudler/_git/llama.cpp checkout -- ggml/src/ggml-cuda/gated_delta_net.cu +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git checkout -- ggml/src/ggml-cuda/gated_delta_net.cu' +``` + +- [x] **Step 3: Commit LocalAI docs** + +Commit accepted patch/docs or rejected docs with: + +```bash +git commit -m "docs(paged): record GDN global Ai32 result" \ + -m "Assisted-by: Codex:gpt-5" +``` + +Result: + +- No fork commit and no `0055` LocalAI patch were generated. +- Rejected diff saved at: + `/home/mudler/bench/phase13_gdn_global_ai32/rejected/global_ai32_rejected.diff`. +- llama.cpp fork and DGX mirror were restored to the prior accepted state. diff --git a/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md b/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md new file mode 100644 index 000000000000..6380e6a258f5 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md @@ -0,0 +1,385 @@ +# GDN M5 State-Boundary Phase 11 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test a low-conflict C=16 M5 GDN prefill variant that moves/reuses the QS state-boundary product earlier without changing chunk size or decode routing. + +**Architecture:** Phase 11 is fork-first and default-off. It keeps the shipped C=16 M5 path as the baseline, adds one explicit env-selected candidate in `gated_delta_net.cu`, and accepts it only if focused op gates, canonical md5 gates, and same-session GB10 A/B benchmarks all pass. + +**Tech Stack:** llama.cpp CUDA GDN kernel, DGX GB10 CUDA build, Qwen3.6 NVFP4 MoE/dense GGUF, LocalAI paged patch stack. + +--- + +## Guardrails + +- Do not reintroduce C32 slab code. +- Do not change the default `GDN_TC=5` path until the candidate wins. +- Keep `GDN_CHUNK_MIN > 1`; decode must stay on the sequential recurrence. +- Prefer one env-gated shortcut over new helper files or global scratch. +- Gate every change with `GATED_DELTA_NET` and canonical MoE/dense md5 before + any performance claim. + +## Task 1: Preflight and Baseline + +**Files:** +- Read-only: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/` + +- [x] **Step 1: Check DGX is free** + +Run: + +```bash +ssh dgx.casa 'set -e +echo docker=$(docker ps -q | wc -l) +echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) +echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' +``` + +Expected: + +```text +docker=0 +local_ai_worker=0 +compute=0 +FREE... +``` + +- [x] **Step 2: Record source provenance** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && git status --short && git rev-parse HEAD' +git -C /home/mudler/_git/llama.cpp status --short +git -C /home/mudler/_git/llama.cpp rev-parse HEAD +``` + +Expected: clean llama.cpp fork and DGX mirror before source edits. + +- [x] **Step 3: Create artifact directory** + +Run: + +```bash +ssh dgx.casa 'mkdir -p /home/mudler/bench/phase11_gdn_m5_state_boundary/{gates,ab,rejected}' +``` + +Expected: command exits 0. + +## Task 2: Add Default-Off Candidate + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Mirror: `/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu` + +- [x] **Step 1: Add an env selector** + +Add a static env flag near the existing `gdn_tc` selector: + +```cpp +static const bool gdn_m5_qs_early = []{ + const char * e = getenv("GDN_M5_QS_EARLY"); + return e && atoi(e) != 0; +}(); +``` + +Route it only for `S_v == 128 && n_tokens >= gdn_chunk_min && gdn_tc >= 4`. + +- [x] **Step 2: Add a template boolean for the candidate** + +Extend the chunked launch templates with a defaulted boolean, keeping existing +call sites source-compatible: + +```cpp +template +__global__ void gated_delta_net_chunked_cuda(...) +``` + +and: + +```cpp +template +static void launch_gdn_chunked(...) +``` + +Use the boolean only inside the M3/M5 code path. Existing launches must remain +`launch_gdn_chunked<128, 16, TC_>(...)`. + +- [x] **Step 3: Move QS deposition earlier for candidate only** + +In `gated_delta_net_chunked_cuda`, after the KS/RHS section and before the +`solve A U = RHS` section, add a candidate-only QS pass: + +```cpp +if constexpr (QS_EARLY && TC >= 2) { + const int w = threadIdx.x >> 5; + const int lane = threadIdx.x & 31; + const int lg = lane >> 2; + const int lt = lane & 3; + constexpr int NWARP = S_v / 32; + constexpr int NT = dv / 8; + constexpr int NTPW = (NT + NWARP - 1) / NWARP; + #pragma unroll + for (int mt = 0; mt < (C + 15) / 16; mt++) { + const int rowbase = mt * 16; + #pragma unroll + for (int nn = 0; nn < NTPW; nn++) { + const int nt = w * NTPW + nn; + if (nt >= NT) break; + const int colbase = nt * 8; + float cc[4]; + gdn_gram_tile_mma_3x(cc, Qc, Sd, rowbase, colbase, lg, lt); + const int tt[4] = {rowbase + lg, rowbase + lg, rowbase + lg + 8, rowbase + lg + 8}; + const int jj[4] = {colbase + 2*lt, colbase + 2*lt + 1, colbase + 2*lt, colbase + 2*lt + 1}; + #pragma unroll + for (int l = 0; l < 4; l++) { + const int t = tt[l], jc = jj[l]; + if (t < Cc && jc < dv) { + attn_base[(c0 + t) * S_v * H + jc] = gam[t] * cc[l]; + } + } + } + } + __syncthreads(); +} +``` + +Then change the existing QS deposition block to: + +```cpp +if constexpr (TC >= 2 && !QS_EARLY) { + ... +} +``` + +This is intentionally conservative. It should not change math order for the +deposited QS values, only their scheduling relative to the solve/P build. + +- [x] **Step 4: Add a candidate launch arm** + +In `launch_gated_delta_net`, when `gdn_m5_qs_early && gdn_tc >= 4`, call: + +```cpp +launch_gdn_chunked<128, 16, 4, true>(...); +return; +``` + +Default M5 must continue to call `launch_gdn_chunked<128, 16, 4>(...)`. + +- [x] **Step 5: Mirror to DGX and build** + +Run: + +```bash +rsync -a /home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu \ + dgx.casa:/home/mudler/llama-phase6-source/ggml/src/ggml-cuda/gated_delta_net.cu +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target test-backend-ops llama-completion llama-batched-bench -j 8' +``` + +Expected: build exits 0. + +Result: + +- Candidate implemented as a default-off `GDN_M5_QS_EARLY=1` path in the + llama.cpp fork. +- The patch only touched `ggml/src/ggml-cuda/gated_delta_net.cu`. +- DGX build passed for `test-backend-ops`, `llama-completion`, and + `llama-batched-bench`. + +## Task 3: Correctness Gates + +**Files:** +- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/` + +- [x] **Step 1: Run focused op gates** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates +./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_default.txt" 2>&1 +GDN_M5_QS_EARLY=1 GDN_TC=5 GDN_CHUNK_MIN=2 ./test-backend-ops test -b CUDA0 -o GATED_DELTA_NET -j 1 > "$ART/gated_delta_net_qs_early.txt" 2>&1' +``` + +Expected: both logs show CUDA0 `OK` for all `GATED_DELTA_NET` cases. + +- [x] **Step 2: Run canonical md5 gates** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase11_gdn_m5_state_boundary/gates +LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1" +LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=2 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1" +env $LBASE ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" "$ART/gate_moe_default.txt" 2> "$ART/gate_moe_default.err" +md5sum "$ART/gate_moe_default.txt" | tee "$ART/gate_moe_default.md5" +env $LBASE ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" "$ART/gate_dense_default.txt" 2> "$ART/gate_dense_default.err" +md5sum "$ART/gate_dense_default.txt" | tee "$ART/gate_dense_default.md5" +env $LCAND ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" "$ART/gate_moe_qs_early.txt" 2> "$ART/gate_moe_qs_early.err" +md5sum "$ART/gate_moe_qs_early.txt" | tee "$ART/gate_moe_qs_early.md5" +env $LCAND ./llama-completion -m /home/mudler/bench/q36-27b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" "$ART/gate_dense_qs_early.txt" 2> "$ART/gate_dense_qs_early.err" +md5sum "$ART/gate_dense_qs_early.txt" | tee "$ART/gate_dense_qs_early.md5"' +``` + +Expected: + +```text +MoE 8cb0ce23777bf55f92f63d0292c756b0 +Dense 5951a5b4d624ce891e22ab5fca9bc439 +``` + +- [x] **Step 3: Stop if md5 changes** + +If either candidate md5 differs, do not benchmark yet. Run the KL gate from +`backend/cpp/llama-cpp-localai-paged/docs/PAGED_BITEXACT_NOTE.md` and accept +only if KL is benign and the transcript is sane. + +Result: + +- Default op gate: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_default.txt`. +- QS-early op gate: `/home/mudler/bench/phase11_gdn_m5_state_boundary/gates/gated_delta_net_qs_early.txt`. +- Both focused op logs reported the same OK count. +- Default md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- QS-early md5: + - MoE `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense `5951a5b4d624ce891e22ab5fca9bc439`. +- KL was not needed because md5 matched canonical exactly. + +## Task 4: Performance A/B + +**Files:** +- Artifact: `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/` + +- [x] **Step 1: Run same-session MoE and dense A/B** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase11_gdn_m5_state_boundary/ab +mkdir -p "$ART" +LBASE="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GGML_NO_BACKTRACE=1" +LCAND="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_M5_QS_EARLY=1 GGML_NO_BACKTRACE=1" +env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_base.txt" 2>&1 +env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/moe_qs_early.txt" 2>&1 +env $LBASE ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_base.txt" 2>&1 +env $LCAND ./llama-batched-bench -m /home/mudler/bench/q36-27b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512,2048 -ntg 4 -npl 32 > "$ART/dense_qs_early.txt" 2>&1' +``` + +Expected: candidate improves S_PP for at least the target MoE prefill cases and +does not regress dense outside noise. + +- [x] **Step 2: Decide accept/reject** + +Accept only if: + +- op gates pass, +- md5 is canonical or KL-benign, +- MoE S_PP improves, +- dense S_PP does not regress, +- decode routing remains untouched by `GDN_CHUNK_MIN > 1`. + +Reject if the candidate is flat/slower. Save: + +```bash +git -C /home/mudler/_git/llama.cpp diff -- ggml/src/ggml-cuda/gated_delta_net.cu \ + > /home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff +``` + +Then restore fork and DGX mirror. + +Result: + +Artifacts: + +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/moe_qs_early.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_base.txt` +- `/home/mudler/bench/phase11_gdn_m5_state_boundary/ab/dense_qs_early.txt` + +| Model | Mode | PP | TG | B | S_PP t/s | S_TG t/s | S t/s | +|-------|------|----|----|---|----------|----------|-------| +| MoE | M5 base | 512 | 4 | 32 | 2325.67 | 355.60 | 2229.90 | +| MoE | QS-early | 512 | 4 | 32 | 2315.77 | 353.27 | 2220.16 | +| MoE | M5 base | 2048 | 4 | 32 | 2441.54 | 390.53 | 2416.80 | +| MoE | QS-early | 2048 | 4 | 32 | 2420.26 | 389.89 | 2395.94 | +| Dense | M5 base | 512 | 4 | 32 | 975.15 | 142.71 | 932.97 | +| Dense | QS-early | 512 | 4 | 32 | 968.23 | 144.24 | 927.17 | +| Dense | M5 base | 2048 | 4 | 32 | 1021.06 | 183.34 | 1012.04 | +| Dense | QS-early | 2048 | 4 | 32 | 1015.77 | 183.73 | 1006.88 | + +Decision: + +- Reject the QS-early source patch. +- The candidate is correctness-clean but does not improve S_PP and slightly + regresses both model families. +- Rejected diff saved at: + `/home/mudler/bench/phase11_gdn_m5_state_boundary/rejected/qs_early_rejected.diff`. +- The llama.cpp fork and DGX mirror were restored to the prior accepted state. + +## Task 5: Mirror Accepted Patch or Record Rejection + +**Files:** +- Create if accepted: `backend/cpp/llama-cpp-localai-paged/patches/paged/0055-...patch` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` + +- [x] **Step 1: If accepted, commit fork patch** + +Commit in `/home/mudler/_git/llama.cpp` only after gates pass: + +```bash +git add ggml/src/ggml-cuda/gated_delta_net.cu +git commit -m "feat(cuda): add gated delta net M5 QS-early path" +``` + +- [x] **Step 2: Generate LocalAI patch** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp format-patch -1 HEAD \ + --stdout > backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch +``` + +Do not hand-edit the generated patch. + +- [x] **Step 3: Update docs and commit LocalAI** + +Record artifacts, md5/KL results, A/B numbers, and the decision. Commit with: + +```bash +git add backend/cpp/llama-cpp-localai-paged/patches/paged/0055-feat-cuda-add-gated-delta-net-M5-QS-early-path.patch \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md +git commit -m "feat(paged): add GDN M5 QS-early path" \ + -m "Assisted-by: Codex:gpt-5" +``` + +If rejected, commit docs only: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \ + docs/superpowers/plans/2026-07-01-gdn-m5-state-boundary-phase11.md +git commit -m "docs(paged): record GDN M5 QS-early result" \ + -m "Assisted-by: Codex:gpt-5" +``` + +Result: + +- No fork commit and no LocalAI `0055` patch were generated because the + candidate failed the performance gate. +- Phase 11 is a docs-only rejection record. diff --git a/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md new file mode 100644 index 000000000000..a32f6dc67018 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md @@ -0,0 +1,332 @@ +# GDN Shared-A/Ai Cost Model Phase 12 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Decide whether a shared-A/Ai C32 GDN design is worth implementing on GB10 before touching llama.cpp source. + +**Architecture:** Phase 12 is analysis-first and docs-only unless the cost model proves a credible win. It extracts model dimensions, computes dynamic-smem and global-scratch pressure, estimates traffic saved versus traffic added, and writes a go/no-go decision for a possible Phase 13 global-scratch prototype. + +**Tech Stack:** llama.cpp CUDA GDN kernel geometry, vLLM/FLA chunked GDN references, DGX GB10 benchmark artifacts, LocalAI parity docs. + +--- + +## Guardrails + +- Do not edit llama.cpp source in this phase. +- Do not generate a LocalAI patch file in this phase. +- Treat Phase 10 and Phase 11 as rejected; do not reopen C32 slab or QS-early. +- Use actual model metadata where available; if a dimension is inferred, mark it + as inferred. +- The output is a go/no-go decision, not an implementation patch. + +## Task 1: Gather Current Evidence + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py` +- Read: `/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py` +- Read: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/` + +- [x] **Step 1: Check tree state** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp status --short +git -C /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention status --short +``` + +Expected: + +- llama.cpp fork is clean. +- LocalAI worktree only has this Phase 12 docs work and untracked `.claude/`. + +- [x] **Step 2: Create artifact directory** + +Run: + +```bash +ssh dgx.casa 'mkdir -p /home/mudler/bench/phase12_gdn_shared_ai_cost_model' +``` + +Expected: command exits 0. + +- [x] **Step 3: Record reference function map** + +Record these llama.cpp insertion points in the result doc: + +```text +/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/gated_delta_net.cu + gated_delta_net_chunked_cuda + launch_gdn_chunked + launch_gated_delta_net + ggml_cuda_op_gated_delta_net +``` + +Record these vLLM reference functions: + +```text +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/chunk.py + chunk_gated_delta_rule_fwd +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/solve_tril.py + solve_tril + solve_tril_16x16_kernel + merge_16x16_to_32x32_inverse_kernel + merge_16x16_to_64x64_inverse_kernel +/home/mudler/_git/vllm/vllm/model_executor/layers/fla/ops/wy_fast.py + recompute_w_u_fwd +``` + +Result: recorded in +`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`. + +## Task 2: Extract Model Dimensions + +**Files:** +- Artifact: `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt` + +- [x] **Step 1: Extract GGUF metadata** + +Run on DGX: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +{ + echo "=== MoE ===" + ./llama-show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -n 0 2>/dev/null || true + echo "=== Dense ===" + ./llama-show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf 2>/dev/null || ./llama-cli --show-info -m /home/mudler/bench/q36-27b-nvfp4.gguf -n 0 2>/dev/null || true +} > /home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt' +``` + +Expected: metadata file contains head count, layer count, and head dimension +or enough tensor metadata to infer them. + +Result: + +- Metadata artifact: + `/home/mudler/bench/phase12_gdn_shared_ai_cost_model/model_metadata.txt`. +- `llama-show-info` was not present in the DGX build, so a minimal read-only + GGUF metadata parser was used. + +- [x] **Step 2: Summarize GDN dimensions** + +Write a short table in the result doc: + +```text +Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 +``` + +Use benchmark shapes: + +- `npl=32` +- `npp=512,2048` +- `S_v=128` + +If H cannot be read directly from metadata, infer it from source/model docs and +mark the row as inferred. + +Result: + +| Model | GDN layers | H | S_v | benchmark npl | npp | chunks at BT=32 | chunks at BT=64 | +|-------|------------|---|-----|---------------|-----|-----------------|-----------------| +| MoE | 30 inferred | 32 inferred | 128 | 32 | 512 | 16 | 8 | +| MoE | 30 inferred | 32 inferred | 128 | 32 | 2048 | 64 | 32 | +| Dense | 48 inferred | 48 inferred | 128 | 32 | 512 | 16 | 8 | +| Dense | 48 inferred | 48 inferred | 128 | 32 | 2048 | 64 | 32 | + +`H = ssm.inner_size / ssm.state_size`. + +## Task 3: Compute Smem and Scratch Costs + +**Files:** +- Create: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` + +- [x] **Step 1: Record dynamic-smem formulas** + +Use: + +```text +C16 full-width current M5: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 full-width: + floats = S_v*S_v + 2*C*S_v + S_v*C + C*C + 3*C + 2*C*C + +C32 slab64 with U staging: + floats = S_v*64 + 2*C*S_v + 64*C + C*C + 3*C + 2*C*C + 64*C +``` + +Expected values for `S_v=128`: + +```text +C16 full-width: 93,376 B / 91.19 KiB +C32 full-width: 127,360 B / 124.38 KiB +C32 slab64: 94,592 B / 92.38 KiB +``` + +- [x] **Step 2: Record Ai scratch formulas** + +Use: + +```text +Ai scratch bytes = npl * H * ceil(npp / BT) * BT * BT * sizeof(dtype) +``` + +Compute for: + +- `BT=32`, f32 and f16/bf16 Ai. +- `BT=64`, f32 and f16/bf16 Ai. +- `npp=512` and `npp=2048`. + +- [x] **Step 3: Estimate extra global traffic** + +For a two-slab C32 design, estimate: + +```text +Ai write once = npl * H * nchunks * BT * BT * sizeof(Ai) +Ai read per slab = 2 * Ai write once +total Ai traffic = 3 * Ai write once +``` + +Record the estimate in MiB for every benchmark shape. + +- [x] **Step 4: Estimate work saved** + +Record that shared Ai saves duplicated A/T construction per second slab: + +```text +saved per chunk/head = one KK/QK-derived A/T solve/apply setup currently duplicated by C32 slab +not saved = KS, QS, U, P*U, state update, state traffic +``` + +Do not claim a speedup from this estimate alone. The result doc must say whether +the saved work is large enough to justify the scratch traffic and kernel +boundary risk. + +Result: recorded in +`backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md`. +The f32 `BT=32` scratch path costs 256 MiB (MoE) and 384 MiB (dense) at +`npp=2048,npl=32`, with 768 MiB and 1.125 GiB of Ai traffic respectively. + +## Task 4: Go/No-Go Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Write the decision** + +Use one of these exact decisions: + +```text +GO: Phase 13 should implement a default-off global-Ai scratch prototype. +``` + +or: + +```text +NO-GO: shared-A/Ai scratch is not credible on GB10; stop GDN kernel work here. +``` + +The decision must cite the scratch size and Ai traffic estimates. + +Decision: + +```text +GO: Phase 13 should implement a default-off global-Ai scratch prototype. +``` + +Rationale: the scratch/traffic cost is high enough to require strict gates, but +not high enough to reject without a default-off prototype. + +- [x] **Step 2: If GO, write Phase 13 scope** + +If GO, create: + +```text +docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md +docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md +``` + +The Phase 13 plan must include: + +- default-off env selector, +- scratch allocation strategy, +- op gate, +- canonical MoE/dense md5 gates, +- same-session A/B, +- rejection path. + +Result: + +- `docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md`. +- `docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md`. + +- [x] **Step 3: If NO-GO, update final records** + +If NO-GO, update: + +- `VLLM_PARITY_FINAL.md` +- `PARITY_HANDOFF.md` + +Record that GDN kernel work on GB10 is exhausted by evidence, not assumption. + +Result: not applicable because Phase 12 is GO. The final/handoff records are +not changed to close GDN work. + +## Task 5: Verification and Commit + +**Files:** +- Modify/create the files from Task 4. + +- [x] **Step 1: Verify docs** + +Run: + +```bash +git diff --check +git status --short +``` + +Expected: + +- no whitespace errors, +- only intended docs are modified plus untracked `.claude/`. + +Result: + +- `git diff --check` exited 0. +- `/home/mudler/_git/llama.cpp` was clean. +- DGX metadata artifact existed and contained MoE/dense GGUF metadata. + +- [x] **Step 2: Commit docs** + +For GO: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md +git add -f docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md \ + docs/superpowers/plans/2026-07-01-gdn-global-ai-prototype-phase13.md \ + docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md +git commit -m "docs(paged): scope GDN shared-Ai prototype" \ + -m "Assisted-by: Codex:gpt-5" +``` + +For NO-GO: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GDN_SHARED_AI_COST_MODEL.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-gdn-shared-ai-cost-model-phase12.md +git commit -m "docs(paged): close GDN shared-Ai cost model" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md b/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md new file mode 100644 index 000000000000..066803fc5141 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-graph-node-serving-profile-phase27.md @@ -0,0 +1,89 @@ +# Graph Node Serving Profile Phase 27 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Re-profile the current clean llama.cpp serving path with CUDA graph +node tracing so source decisions are based on the required decode profiling +method. + +**Architecture:** This is a profile-only phase. It does not edit llama.cpp +source. It runs md5/op gates before and after a graph-node-traced n128 serving +profile, then records whether the bucket decomposition changes the Phase 8 +helper-dispatch decision. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, Nsight Systems +`--cuda-graph-trace=node`, `paged-inference-gates.sh`, LocalAI parity docs. + +--- + +## Checklist + +- [x] **Step 1: Confirm the profiling gap** + - Phase 8 used an ordinary Nsight serving profile. + - Current handoff requires `--cuda-graph-trace=node` for decode/serving + profiles because CUDA graph replay can collapse kernel attribution. + +- [x] **Step 2: Check DGX preflight before gates** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - GPU owner file was `FREE`. + +- [x] **Step 3: Run pre-profile inference gates** + - Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519/gate_pre` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 4: Fix Nsight session-control syntax** + - A first attempt failed because `nsys launch` on Nsight Systems + `2025.3.2.474-253236389321v0` rejects `--cpuctxsw`. + - A smoke test showed the correct split: + `nsys launch --trace=cuda --cuda-graph-trace=node ...` and + `nsys start --sample=none --cpuctxsw=none -o OUT`. + - Do not put `--trace`, `--cuda-graph-trace`, or `--cpuctxsw` all on both + commands for this Nsight version. + +- [x] **Step 5: Run graph-node-traced n128 serving profile** + - Artifact: `/home/mudler/bench/phase27_graph_node_serving/20260701_055519` + - Source: `f2521ab12 feat(server): trace speculative batch shapes` + - Hardware: `GPU 0: NVIDIA GB10`, driver `580.159.03`, compute `12.1` + - Serving shape: `n=128`, `PTOK=128`, `GEN=64` + - Client result: `decode_agg_tps=675.5`, `agg_tps=319.9`, + `prefill_tps=1671.1`, `TTFT mean=8363.4 ms` + +- [x] **Step 6: Run post-profile inference gates** + - The immediate post-gate raced with Nsight teardown and reported one compute + process even though `nvidia-smi` printed no running processes. + - Retried after idle preflight: + `/home/mudler/bench/phase27_graph_node_serving/20260701_055519/gate_post_retry` + - Retry MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Retry dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - Retry `MUL_MAT_ID`: `806/806` + +- [x] **Step 7: Bucket the graph-node trace** + - `buckets.txt` was generated from + `llama_graph_node.nsys-rep`. + - Macro buckets: + - GDN: `6706.33 ms` (`33.47%`) + - MoE/FFN-GEMM: `5871.92 ms` (`29.31%`) + - bf16-proj: `2725.07 ms` (`13.60%`) + - layout-copy: `1309.99 ms` (`6.54%`) + - act-quant: `697.75 ms` (`3.48%`) + - MoE-dispatch: `275.99 ms` (`1.38%`) + - FA: `271.03 ms` (`1.35%`) + +- [x] **Step 8: Record decision** + - Fine rows confirm the Phase 8 source shortcut rejection: + `gdn_core=29.59%`, `mmq_nvfp4=28.44%`, `mm_ids=0.61%`, + `gather_mmq=0.37%`, `argsort_topk=0.40%`. + - Do not reopen metadata/helper-only MoE dispatch work on GB10. + - A credible patch must directly reduce GDN, grouped-MMQ, or projection work + while preserving md5/op gates. + +## Result + +Phase 27 strengthens the profile basis for the current GB10 conclusion. It does +not find a new low-conflict source shortcut. The profile is representative of +Phase 26 n128 serving throughput and keeps the inference gates green after a +post-teardown retry. diff --git a/docs/superpowers/plans/2026-07-01-handoff-current-state-cleanup-phase23.md b/docs/superpowers/plans/2026-07-01-handoff-current-state-cleanup-phase23.md new file mode 100644 index 000000000000..32ef4a390940 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-handoff-current-state-cleanup-phase23.md @@ -0,0 +1,84 @@ +# Handoff Current-State Cleanup Phase 23 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** remove stale handoff coordinates that predate patches `0051-0055` and +the current clean DGX mirror. + +**Architecture:** documentation-only cleanup. Keep the measured benchmark +verdict unchanged, but make the handoff point to current source, patch count, +artifact, harness, and contribution-policy facts. + +**Tech Stack:** LocalAI paged docs, llama.cpp fork metadata, git status output. + +--- + +## Task 1: Identify Stale Coordinates + +- [x] **Step 1: Scan for old fork/patch references** + + Found stale references in `PARITY_HANDOFF.md`: + + - fork HEAD `d9b9be0be` and patch `0050`; + - `41` patch files spanning `0001-0050`; + - old `combined_definitive.sh` as the current reference harness; + - stale ahead/behind count; + - old AI attribution and sign-off wording. + +## Task 2: Update Handoff + +- [x] **Step 1: Update canonical source and mirror invariant** + + Current values: + + - local fork HEAD `fb9402661291e0488a3e2bf2f3948ebcd18e18c9`; + - DGX clean mirror HEAD `f2521ab12`; + - `46` patch files through `0055`; + - verified tree `5bdbf8ea3d750fe6fa1f85175fd6357d36222edb`. + +- [x] **Step 2: Update harness guidance** + + Current harness: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Historical harness: + + - `dgx:~/bench/combined_definitive.sh` should not be reused without porting. + +- [x] **Step 3: Update contribution policy text** + + Current policy: + + - no AI `Signed-off-by`; + - no AI `Co-Authored-By`; + - use `Assisted-by: Codex:gpt-5`. + +## Task 3: Verification + +- [x] **Step 1: Re-scan for targeted stale strings** + + Command searched for: + + - `d9b9be0be` + - `41.*patch` + - `0001-0050` + - `199 ahead` + - `25 behind` + - `llama-paged-fork` + - stale `combined_definitive.sh` reference-harness wording + - old Claude attribution + + Result: + + - no targeted stale strings remain in `PARITY_HANDOFF.md`. + +## Self-Review + +- No source or benchmark behavior changed. +- The cleanup aligns the handoff with Phases 20-22. +- The parity verdict remains unchanged: current GB10 stack is still below vLLM + serving parity; the next credible path is new hardware or a larger fused-kernel + project. diff --git a/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md b/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md new file mode 100644 index 000000000000..ba3a23ec3beb --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md @@ -0,0 +1,154 @@ +# Phase44 Hardware Pivot Harness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make the current-stack serving snapshot harness configurable enough to run the same audited paged-vs-vLLM methodology on hardware beyond the current GB10 defaults. + +**Architecture:** Keep this as a harness-only change: add environment overrides for vLLM serving limits and print them in `DRY_RUN=1` output. Do not touch llama.cpp inference code, patch-series source, md5 gates, or op gates. + +**Tech Stack:** Bash harness, DGX preflight over ssh, LocalAI parity documentation. + +--- + +### Task 1: Prove the vLLM config knobs are absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run help-text red check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_MAX_NUM_SEQS' +``` + +Expected: exit `1`, because the harness does not document the override yet. + +- [x] **Step 2: Run DGX dry-run red check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase44_hardware_pivot_harness_dryrun_red/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_GPU_MEMORY_UTILIZATION=0.90 VLLM_MAX_MODEL_LEN=8192 VLLM_MAX_NUM_SEQS=512 VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_EXTRA_ARGS="--disable-log-requests" OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh | grep -F 'VLLM_MAX_NUM_SEQS=512' +``` + +Expected: exit `1`, because `DRY_RUN=1` validates inputs but does not print the vLLM serving config yet. + +### Task 2: Add vLLM serving overrides + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Document the environment variables in `usage()`** + +Add these lines under `VLLM_BIN`: + +```bash + VLLM_GPU_MEMORY_UTILIZATION vLLM --gpu-memory-utilization (default: 0.85) + VLLM_MAX_MODEL_LEN vLLM --max-model-len (default: 4096) + VLLM_MAX_NUM_SEQS vLLM --max-num-seqs (default: 256) + VLLM_TENSOR_PARALLEL_SIZE vLLM --tensor-parallel-size (default: 1) + VLLM_EXTRA_ARGS whitespace-split extra args appended to vLLM serve (default: empty) +``` + +- [x] **Step 2: Add conservative defaults beside `VLLM_BIN`** + +```bash +VLLM_GPU_MEMORY_UTILIZATION=${VLLM_GPU_MEMORY_UTILIZATION:-0.85} +VLLM_MAX_MODEL_LEN=${VLLM_MAX_MODEL_LEN:-4096} +VLLM_MAX_NUM_SEQS=${VLLM_MAX_NUM_SEQS:-256} +VLLM_TENSOR_PARALLEL_SIZE=${VLLM_TENSOR_PARALLEL_SIZE:-1} +VLLM_EXTRA_ARGS=${VLLM_EXTRA_ARGS:-} +``` + +- [x] **Step 3: Use the variables in `run_vllm()`** + +Use an array for `VLLM_EXTRA_ARGS`: + +```bash + local extra_args=() + if [[ -n "$VLLM_EXTRA_ARGS" ]]; then + read -r -a extra_args <<< "$VLLM_EXTRA_ARGS" + fi +``` + +Then replace the hardcoded vLLM flags with: + +```bash + --served-model-name q36 --gpu-memory-utilization "$VLLM_GPU_MEMORY_UTILIZATION" --max-model-len "$VLLM_MAX_MODEL_LEN" \ + --max-num-seqs "$VLLM_MAX_NUM_SEQS" --host 127.0.0.1 --port "$VLLM_PORT" --tensor-parallel-size "$VLLM_TENSOR_PARALLEL_SIZE" \ + "${extra_args[@]}" \ +``` + +- [x] **Step 4: Print the vLLM config during `DRY_RUN=1`** + +```bash + log "vLLM config: VLLM_GPU_MEMORY_UTILIZATION=$VLLM_GPU_MEMORY_UTILIZATION VLLM_MAX_MODEL_LEN=$VLLM_MAX_MODEL_LEN VLLM_MAX_NUM_SEQS=$VLLM_MAX_NUM_SEQS VLLM_TENSOR_PARALLEL_SIZE=$VLLM_TENSOR_PARALLEL_SIZE VLLM_EXTRA_ARGS=[$VLLM_EXTRA_ARGS]" +``` + +### Task 3: Verify the harness + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Help-text green check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_MAX_NUM_SEQS' +``` + +Expected: exit `0`. + +- [x] **Step 3: DGX dry-run green check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase44_hardware_pivot_harness_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_GPU_MEMORY_UTILIZATION=0.90 VLLM_MAX_MODEL_LEN=8192 VLLM_MAX_NUM_SEQS=512 VLLM_TENSOR_PARALLEL_SIZE=2 VLLM_EXTRA_ARGS="--disable-log-requests" OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, preflight shows docker/local-ai-worker/GPU compute idle, and output includes `VLLM_MAX_NUM_SEQS=512`. + +### Task 4: Record Phase44 in docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md` + +- [x] **Step 1: Append the Phase44 result** + +Record that Phase44 is a harness-readiness change only. It does not claim a new performance result, does not run inference, and does not modify md5/op gate behavior. + +- [x] **Step 2: Mark all plan tasks complete** + +Change each remaining `- [ ]` entry in this file to `- [x]` only after the corresponding verification has been run. + +### Task 5: Commit + +**Files:** +- Commit all Phase44 script, docs, and plan changes. + +- [x] **Step 1: Run final diff checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-hardware-pivot-harness-phase44.md +git commit -m "feat(paged): parameterize vllm serving snapshot" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md b/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md new file mode 100644 index 000000000000..3c6acb9883f7 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md @@ -0,0 +1,87 @@ +# Phase45 Inference Gate Guard Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Prove the current DGX build still passes the canonical paged inference md5 and backend-op gates after the harness-only Phase44 change. + +**Architecture:** Run the existing DGX `~/paged-inference-gates.sh` script against `~/llama-phase6-source/build-phase36/bin` with both `MUL_MAT` and `MUL_MAT_ID` op filters. Record the artifact in the parity docs; do not change llama.cpp inference source. + +**Tech Stack:** DGX ssh, Bash gate harness, LocalAI parity documentation. + +--- + +### Task 1: Confirm DGX gate preflight + +**Files:** +- Test only: DGX runtime state. + +- [x] **Step 1: Check docker, LocalAI worker, GPU compute, and lock owner** + +```bash +ssh dgx.casa 'set -euo pipefail; docker_count=$(docker ps -q | wc -l); local_ai=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true); compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); owner=FREE-no-lock-file; if [ -f "$HOME/gpu_bench_lock/owner" ]; then owner=$(cat "$HOME/gpu_bench_lock/owner"); fi; printf "docker=%s\nlocal_ai_worker=%s\ncompute=%s\nowner=%s\n" "$docker_count" "$local_ai" "$compute" "$owner"' +``` + +Expected: `docker=0`, `local_ai_worker=0`, `compute=0`, and owner starts with `FREE`. + +### Task 2: Run canonical inference gates + +**Files:** +- Test only: `~/paged-inference-gates.sh` on DGX. + +- [x] **Step 1: Run md5 and backend-op gates** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase45_inference_gate_guard/$(date +%Y%m%d_%H%M%S); BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART OPS=MUL_MAT,MUL_MAT_ID ~/paged-inference-gates.sh' +``` + +Expected: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 +1146/1146 tests passed +Backend CUDA0: OK +806/806 tests passed +Backend CUDA0: OK +paged inference gates OK +``` + +### Task 3: Record Phase45 + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md` + +- [x] **Step 1: Append gate artifact and verdict** + +Record the exact artifact directory and the md5/op results. + +- [x] **Step 2: Mark this plan complete** + +Only mark the remaining steps complete after the gate and docs update are done. + +### Task 4: Commit + +**Files:** +- Commit the Phase45 docs and plan. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended docs/plan changes plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-inference-gate-guard-phase45.md +git commit -m "docs(paged): record inference gate guard" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md b/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md new file mode 100644 index 000000000000..61bc0b82ec80 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-layout-trace-phase64.md @@ -0,0 +1,201 @@ +# Layout Trace Phase64 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Attribute the remaining llama.cpp `layout-copy` prefill bucket to concrete graph tensors without changing inference behavior. + +**Architecture:** Add default-off CUDA layout tracing for `GET_ROWS`, `CPY`, `CONT`, `DUP`, and `CONCAT`, gated by `LLAMA_LAYOUT_TRACE=`. Use the same md5/op gates before accepting the instrumentation, then run a bounded MoE prefill trace to decide whether the layout bucket exposes a low-conflict Phase65 source patch. + +**Tech Stack:** llama.cpp CUDA backend, LocalAI paged parity docs, DGX `dgx.casa`, `llama-batched-bench`, canonical md5/op gates. + +--- + +## Guardrails + +- Trace must be silent when `LLAMA_LAYOUT_TRACE` is unset. +- Trace must not alter tensor data or route decisions. +- Do not regenerate LocalAI patch series in this phase. +- Canonical gates: + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT` `1146/1146` + - `MUL_MAT_ID` `806/806` + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- Create: `docs/superpowers/plans/2026-07-01-layout-trace-phase64.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Add Default-Off Layout Trace + +- [x] **Step 1: Inspect measured layout rows** + +Phase63 kernel names at `npp=2048`: + +```text +convert_unary: 721.23ms +convert_unary: 634.91ms +concat_non_cont: 566.04ms +k_get_rows_float: 307.52ms +cpy_scalar: 107.05ms +``` + +- [x] **Step 2: Add trace helper in `ggml-cuda.cu`** + +Implemented `LLAMA_LAYOUT_TRACE=` with route, op, dst/src names, types, +shapes, and contiguity flags. + +- [x] **Step 3: Wire trace calls to runtime dispatch** + +Runtime cases traced: + +- `GGML_OP_GET_ROWS` +- `GGML_OP_DUP` +- `GGML_OP_CPY` +- `GGML_OP_CONT` +- `GGML_OP_CONCAT` + +- [x] **Step 4: Verify local diff** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: no output. + +Result: no output. + +--- + +### Task 2: Build and Gate on DGX + +- [x] **Step 1: Acquire DGX lock** + +Result: + +```text +docker=0 local_ai_worker=0 compute=0 lock=FREE released-by-codex-phase63-prefill-bucket 1782908317 +codex-phase64-layout-trace 1782908645 +``` + +- [x] **Step 2: Apply the patch to DGX clean build tree** + +Result: applied to `/home/mudler/llama-phase6-source`; remote diff was +`ggml/src/ggml-cuda/ggml-cuda.cu | 51 +++++++++++++++++++++++++++++++++++++++++`. + +- [x] **Step 3: Build CUDA targets** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)' +``` + +Result: build passed. + +- [x] **Step 4: Run patched md5/op gates** + +Artifact: `/home/mudler/bench/phase64_layout_trace/20260701_142519` + +```text +patched moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok +patched dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok +patched MUL_MAT 1146/1146 1146/1146 ok +patched MUL_MAT_ID 806/806 806/806 ok +``` + +--- + +### Task 3: Run Bounded Layout Trace + +- [x] **Step 1: Run MoE prefill trace** + +Run: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 LLAMA_LAYOUT_TRACE=12000 \ + ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512 -ntg 4 -npl 32 +``` + +Result files: + +- `/home/mudler/bench/phase64_layout_trace/20260701_142519/layout_trace_npp512.trace` +- `/home/mudler/bench/phase64_layout_trace/20260701_142519/layout_trace_summary2.txt` + +- [x] **Step 2: Reduce trace** + +Route distribution: + +| route | lines | +|-------|------:| +| `get_rows` | `7268` | +| `cpy` | `2008` | +| `cont` | `1734` | +| `concat` | `990` | + +Top type pairs: + +| route/type | count | +|------------|------:| +| `get_rows f32 -> f32` | `6250` | +| `get_rows f16 -> f32` | `1018` | +| `concat f32 -> f32` | `990` | +| `cpy f32 -> f32 noncontig -> contig` | `990` | +| `cont f16 -> f16 noncontig -> contig` | `970` | +| `cont f32 -> f32 noncontig -> contig` | `688` | +| `cpy f32 -> f16 noncontig -> contig` | `660` | +| `cpy f32 -> f16 contig -> contig` | `358` | + +Named sources: + +- `concat conv_states_reshaped-N + qkv_mixed_transposed-N -> conv_input-N` +- `cpy conv_state_last-N -> conv_state_update-N` +- `get_rows cache_r_lN -> conv_states-N` +- `get_rows ffn_moe_probs-N -> ffn_moe_weights-N` +- `get_rows node_* with ffn_moe_topk-N` for expert fan-in weights +- attention mask/KV reshapes and f32-to-f16 copies for paged full-attention layers + +--- + +### Task 4: Commit and Record + +- [x] **Step 1: Commit fork instrumentation** + +Result: `/home/mudler/_git/llama.cpp` commit +`fa944bb5f feat(cuda): trace layout tensor names`. + +- [x] **Step 2: Record LocalAI docs** + +Result: this plan and parity docs updated. + +- [x] **Step 3: Commit LocalAI docs** + +Result: this commit records the Phase64 LocalAI docs. + +Command: + +```bash +git add -f docs/superpowers/plans/2026-07-01-layout-trace-phase64.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record layout trace phase" \ + -m "Assisted-by: Codex:gpt-5" +``` + +--- + +## Decision + +Phase64 keeps the instrumentation patch because it is default-off, low-conflict, +and md5/op gated. It does not yet fund a layout optimization: the trace points at +GDN conv-state materialization, MoE top-k fan-in gathers, and paged-attention +mask/KV reshapes, not a single clean projection/layout shortcut. diff --git a/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md b/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md new file mode 100644 index 000000000000..6ca840aec125 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md @@ -0,0 +1,128 @@ +# Low Concurrency Phase41 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Quantify the low-concurrency GB10 serving gap after Phase40 rejected the max-concurrency C1 shortcut. + +**Architecture:** Reuse the same current-stack serving harness and canonical pre/post inference gates, changing only the concurrency list and llama-server parallel/context sizing. + +**Tech Stack:** Bash harness, DGX GB10, llama.cpp `llama-server`, vLLM OpenAI-compatible server, h2h client, `paged-inference-gates.sh`. + +--- + +### Task 1: Define Low-Concurrency Snapshot + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Select the run shape** + +Use: + +```bash +NPL="1 8 32" +PARALLEL=32 +CTX=32768 +PTOK=128 +GEN=64 +OPS=MUL_MAT,MUL_MAT_ID +``` + +- [x] **Step 2: Validate on DGX dry-run** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase41_low_concurrency_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1 8 32" PARALLEL=32 CTX=32768 PTOK=128 GEN=64 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Observed artifact: `/home/mudler/bench/phase41_low_concurrency_dryrun/20260701_091429`. + +Expected evidence: + +```text +docker=0 +local_ai_worker=0 +compute=0 +would build: cmake --build /home/mudler/llama-phase6-source/build-phase36 --target llama-server llama-completion test-backend-ops -j8 +would run paged NPL=[1 8 32] PTOK=128 GEN=64 +would run vLLM NPL=[1 8 32] PTOK=128 GEN=64 +``` + +### Task 2: Run Low-Concurrency Snapshot With Gates + +**Files:** +- Artifact: `dgx:~/bench/phase41_low_concurrency/20260701_091437` + +- [x] **Step 1: Run the snapshot** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase41_low_concurrency/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1 8 32" PARALLEL=32 CTX=32768 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +- [x] **Step 2: Confirm pre/post gates** + +Observed: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT ok 1146/1146 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT ok 1146/1146 +post op_MUL_MAT_ID ok 806/806 +``` + +- [x] **Step 3: Record serving result** + +Observed: + +```text +arm n agg_tps decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms +paged 1 50.6 56.5 55.61 1221.5 131.8 +paged 8 159.5 222.9 26.72 1438.8 835.9 +paged 32 240.1 393.9 11.15 1615.7 2784.4 +vllm 1 67.5 75.4 74.14 1720.4 95.3 +vllm 8 251.8 296.5 36.12 4558.8 266.0 +vllm 32 454.6 592.4 17.43 5376.5 818.6 +``` + +- [x] **Step 4: Record decision** + +Decision: Phase41 confirms a low-concurrency/latency gap, but it does not reopen D1/full-step graph capture. Patch `0043` already ships grouped-MMQ full-step graph capture default-on, and Phase34 found `host_sync=0/4096`. The measured current-stack gap is around `0.75x` vLLM at `n=1/8` and `0.665x` at `n=32`; TTFT remains the larger user-visible gap. + +### Task 3: Update Handoff Docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md` + +- [x] **Step 1: Add Phase41 sections** + +Record artifact paths, preflight, gate evidence, serving table, and the D1/TTFT implication in all three handoff documents. + +- [x] **Step 2: Verify docs** + +Run: + +```bash +git diff --check +``` + +- [x] **Step 3: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md +git commit -m "docs(paged): record low-concurrency serving check" +``` diff --git a/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md b/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md new file mode 100644 index 000000000000..8dbe486b4978 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md @@ -0,0 +1,152 @@ +# Max Concurrency Phase40 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether the paged llama.cpp GB10 memory advantage produces a higher-concurrency serving operating point that closes or beats vLLM. + +**Architecture:** Use the existing same-session serving snapshot harness with pre/post inference gates. Add only a harness-level `BUILD_DIR` override so the benchmark builds and runs the same selected CMake tree. + +**Tech Stack:** Bash harness, DGX GB10, llama.cpp `llama-server`, vLLM OpenAI-compatible server, h2h client, `paged-inference-gates.sh`. + +--- + +### Task 1: Make The Snapshot Harness Build The Selected Tree + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Write the failing check** + +Run: + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR' +``` + +Expected before the change: exit `1`. + +- [x] **Step 2: Add `BUILD_DIR`** + +Change the harness so: + +```bash +BUILD_DIR=${BUILD_DIR:-"$SRC/build-cuda"} +BIN=${BIN:-"$BUILD_DIR/bin"} +``` + +and build with: + +```bash +cmake --build "$BUILD_DIR" --target llama-server llama-completion test-backend-ops -j 8 +``` + +- [x] **Step 3: Verify locally** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'BUILD_DIR llama.cpp CMake build dir' +``` + +Expected: both exit `0`. + +- [x] **Step 4: Verify on DGX dry-run** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 DRY_RUN=1 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Observed artifact: `/home/mudler/bench/phase40_max_concurrency_dryrun/20260701_090002`. + +Expected evidence: + +```text +docker=0 +local_ai_worker=0 +compute=0 +would build: cmake --build /home/mudler/llama-phase6-source/build-phase36 --target llama-server llama-completion test-backend-ops -j8 +``` + +### Task 2: Run Max-Concurrency Snapshot With Correctness Gates + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` +- Artifact: `dgx:~/bench/phase40_max_concurrency/20260701_090012` + +- [x] **Step 1: Run the gated snapshot** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase40_max_concurrency/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="128 192 256" PARALLEL=256 CTX=262144 PTOK=128 GEN=64 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +- [x] **Step 2: Confirm pre/post inference gates** + +Observed: + +```text +pre moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +pre dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +pre op_MUL_MAT ok 1146/1146 +pre op_MUL_MAT_ID ok 806/806 +post moe_md5 ok 8cb0ce23777bf55f92f63d0292c756b0 +post dense_md5 ok 5951a5b4d624ce891e22ab5fca9bc439 +post op_MUL_MAT ok 1146/1146 +post op_MUL_MAT_ID ok 806/806 +``` + +- [x] **Step 3: Record serving result** + +Observed: + +```text +arm n agg_tps decode_agg_tps decode_perseq_tps prefill_tps ttft_mean_ms +paged 128 326.3 671.8 3.97 1695.2 8182.3 +paged 192 318.3 679.9 2.50 1605.2 11151.6 +paged 256 337.1 829.9 2.09 1525.7 15065.7 +vllm 128 654.4 1013.3 6.72 5206.0 2582.6 +vllm 192 697.7 1185.2 4.88 4787.1 3690.6 +vllm 256 714.1 1306.1 3.90 4471.0 5124.2 +``` + +- [x] **Step 4: Record decision** + +Decision: C1 does not close GB10 parity for the tested `PTOK=128`, `GEN=64`, `NPL=128/192/256` workload. Paged runs safely at `n=256`, but vLLM also fits and remains faster (`paged_decode_over_vllm=0.6354`, `paged_agg_over_vllm=0.4721`). + +### Task 3: Update Handoff Docs + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md` + +- [x] **Step 1: Add Phase40 sections** + +Record artifact paths, gate evidence, throughput table, and C1 decision in all three handoff documents. + +- [x] **Step 2: Verify docs and script** + +Run: + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +git diff --check +``` + +- [x] **Step 3: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + docs/superpowers/plans/2026-07-01-max-concurrency-phase40.md +git commit -m "docs(paged): record max-concurrency parity check" +``` diff --git a/docs/superpowers/plans/2026-07-01-mmid-route-trace-phase34.md b/docs/superpowers/plans/2026-07-01-mmid-route-trace-phase34.md new file mode 100644 index 000000000000..d3b7ffa02461 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmid-route-trace-phase34.md @@ -0,0 +1,49 @@ +# Phase 34: MMID Route Trace + +**Goal:** Add a default-off `MUL_MAT_ID` route classifier so serving traces can prove whether current n128 MoE inference uses graph-safe MMVQ/grouped-MMQ paths or the host-sync fallback. + +**Scope:** llama.cpp fork first, then LocalAI patch `0060`. Instrumentation only; no route or numeric behavior change. + +## Plan + +- [x] Inspect the current `ggml_cuda_mul_mat_id` dispatch order. +- [x] Add a failing host-only test for route classification and trace formatting. +- [x] Implement `ggml_cuda_mmid_route_shape_make()` and formatter in the existing CUDA trace helper. +- [x] Wire `LLAMA_MOE_MMID_ROUTE_TRACE=` in `ggml_cuda_mul_mat_id` using the same predicates as dispatch. +- [x] Build and run `test-cuda-mmq-shape-trace` locally. +- [x] Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace` on DGX. +- [x] Run default-off and trace-enabled md5/op gates. +- [x] Run n128 serving trace and parse route counts. +- [x] Run post-serving md5/op gates. +- [x] Commit fork and DGX mirror, export LocalAI patch `0060`. +- [x] Update README, parity docs, handoff, and patch maintenance. +- [x] Re-run strict patch-series mirror invariant. + +## Results + +Artifact: `/home/mudler/bench/phase34_mmid_route_trace/20260701_072737`. + +Commits: + +- Fork: `6c332094c feat(cuda): trace moe mmid routes` +- DGX mirror: `34a256d14 feat(cuda): trace moe mmid routes` +- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0060-feat-cuda-trace-moe-mmid-routes.patch` + +Gates: + +- Default-off MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Default-off dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- Trace-enabled MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Trace-enabled dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- Post-serving MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Post-serving dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT_ID`: `806/806` in default, trace-enabled, and post-serving gates + +n128 route trace: + +- `mmq`: 2776 +- `mmvq`: 1320 +- `host_sync=0`: 4096 +- Top shapes: `mmq ne2=12` 1096, `mmq ne2=18` 480, `mmvq ne2=8` 360 + +Decision: host-sync fallback is not firing in the current n128 serving path. The next phase should not chase fallback avoidance; it should either target grouped-MMQ small-M internal partitioning or pivot to the next measured bottleneck. diff --git a/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md b/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md new file mode 100644 index 000000000000..511e1e915e77 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-launch-trace-phase31.md @@ -0,0 +1,72 @@ +# MMQ Launch Trace Phase 31 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Extend the default-off MoE MMQ trace so decode serving records launch-shape, stream-k block, efficiency, and fixup facts without changing inference behavior. + +**Architecture:** Keep patch `0056` selector tracing intact and add a second bounded trace line inside `launch_mul_mat_q`, where the actual stream-k block policy and `fixup_needed` are known. The new helper is host-only and tested without CUDA execution; DGX gates validate that default-off and trace-enabled inference md5/op outputs are unchanged. + +**Tech Stack:** llama.cpp CUDA backend, host-only C++ unit test, LocalAI paged patch series, DGX GB10 gate scripts. + +--- + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq-shape-trace.h` + - Add `ggml_cuda_mmq_launch_shape`, make/format helpers for launch metrics. +- Modify: `/home/mudler/_git/llama.cpp/tests/test-cuda-mmq-shape-trace.cpp` + - Add host-only assertions for launch trace formatting. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` + - Emit `[LLAMA_MOE_MMQ_LAUNCH]` when `LLAMA_MOE_MMQ_SHAPE_TRACE` is enabled and grouped-MMQ uses stream-k. +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0057-feat-cuda-trace-moe-mmq-launch-shapes.patch` + - Mirror the fork commit as the next incremental patch. +- Modify docs in `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/` + - README patch table, GB10 results, lever map, handoff, patch maintenance. + +## Checklist + +- [x] **Step 1: Write RED host test** + - Add assertions in `tests/test-cuda-mmq-shape-trace.cpp` that call `ggml_cuda_mmq_launch_shape_make` and expect formatted fields: `ntiles_dst`, `stream_k_blocks`, `tiles_efficiency`, `fixup`, `nsm`, `ntx`, `nty`, `ntzw`. + - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` + - Expected: compile failure because the launch helper does not exist yet. + +- [x] **Step 2: Implement host launch trace helper** + - Add `ggml_cuda_mmq_launch_shape` plus make/format helpers in `mmq-shape-trace.h`. + - Re-run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace` + - Expected: test passes. + +- [x] **Step 3: Wire bounded launch trace** + - In `launch_mul_mat_q`, after `fixup_needed` is known, emit `[LLAMA_MOE_MMQ_LAUNCH]` only when `args.expert_bounds != nullptr`, `args.use_stream_k`, and `LLAMA_MOE_MMQ_SHAPE_TRACE` limit allows it. + - Use a separate static counter from selector trace so the user can see up to N selector and N launch lines. + +- [x] **Step 4: Build and gate on DGX** + - Preflight: verify `docker=0`, `local_ai_worker=0`, `compute=0`, and take the owner lock. + - Build: `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace -j $(nproc)` + - Default-off gate expected: MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID 806/806`. + - Trace gate expected: same md5/op values with bounded shape and launch trace lines. + +- [x] **Step 5: Run n128 serving launch trace** + - Run h2h n128 with `LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 LLAMA_MOE_MMQ_SHAPE_TRACE=4096`. + - Parse `[LLAMA_MOE_MMQ_LAUNCH]` lines into decode-like and prefill-like buckets. + - Decide whether a no-fixup/no-stream-k shortcut is justified from measured `stream_k_blocks`, `tiles_efficiency`, and `fixup`. + +- [x] **Step 6: Mirror patch and update docs** + - Commit llama.cpp fork. + - Generate LocalAI patch `0057` from the fork commit. + - Verify strict patch-series application reaches the fork tree. + - Mark this plan complete with artifact path and gate results. + - Commit LocalAI docs and patch with `Assisted-by: Codex:gpt-5`. + +## Result + +- Fork commit: `/home/mudler/_git/llama.cpp` `c78e537b5 feat(cuda): trace moe mmq launch shapes`. +- DGX mirror commit: `dgx:~/llama-phase6-source` `8b75905e9 feat(cuda): trace moe mmq launch shapes`. +- Artifact: `/home/mudler/bench/phase31_mmq_launch_trace/20260701_064424`. +- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_launch_shape`. +- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`. +- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`. +- Default-off, trace-enabled, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +- n128 traced serving: `decode_agg_tps=691.0`, `agg_tps=337.0`, `prefill_tps=1500.4`, `TTFT mean=7671.0 ms`. +- Launch result: decode-like `4800/4800` and prefill-like `4920/4920` launch lines had `fixup=0` and `stream_k_blocks == ntiles_dst`. + +Decision: do not pursue a no-fixup/no-stream-k shortcut for the current n128 workload. The actual launch path is already taking conventional stream-k tiling with no fixup; the remaining grouped-MMQ gap is the small-M tile/kernel shape itself, not stream-k fixup overhead. diff --git a/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md b/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md new file mode 100644 index 000000000000..3b2606ca9883 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-occupancy-phase28.md @@ -0,0 +1,85 @@ +# MMQ Occupancy Phase 28 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:subagent-driven-development (recommended) or +> superpowers:executing-plans to implement this plan task-by-task. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test the remaining low-conflict NVFP4 grouped-MMQ occupancy knobs +against the current GB10 serving gap, with md5/op gates before accepting any +performance result. + +**Architecture:** Build-vs-build A/B only. The knobs are existing default-off +compile-time macros in the llama.cpp fork, so this phase does not edit source +unless a variant clears the serving gate. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, `paged-inference-gates.sh`, +h2h n128 serving client, LocalAI parity docs. + +--- + +## Checklist + +- [x] **Step 1: Confirm candidate scope** + - Projection/FP8 follow-up was rejected by source/docs review: it is already + documented as too small or KL-failing. + - The remaining small candidate was NVFP4 MMQ occupancy: + `GGML_CUDA_FP4_MINBLOCKS=2` and `GGML_CUDA_FP4_MMQ_Y`. + +- [x] **Step 2: Check DGX preflight** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - GPU owner file was `FREE`. + +- [x] **Step 3: Run baseline inference gates** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_baseline` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 4: Build and gate `GGML_CUDA_FP4_MINBLOCKS=2`** + - Build dir: `/home/mudler/llama-phase6-source/build-phase28-minblocks2` + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 5: Try `GGML_CUDA_FP4_MMQ_Y=64`** + - Build dir: `/home/mudler/llama-phase6-source/build-phase28-mmqy64` + - Result: compile-time reject. + - Failure invariant: `static_assert(nwarps*tile_C::I == mmq_y)`. + - Decision: do not run combined `MMQ_Y=64+MINBLOCKS=2`; the row-tile + specialization is invalid before serving can be measured. + +- [x] **Step 6: Run same-session n128 serving A/B for the viable variant** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/serving_ab` + - Baseline mean, two reps: `decode_agg_tps=705.1`, + `decode_perseq_tps=3.970`, `agg_tps=328.8`. + - `MINBLOCKS=2` mean, two reps: `decode_agg_tps=689.9`, + `decode_perseq_tps=3.905`, `agg_tps=326.4`. + - Ratio: `decode_agg_tps=0.9784`, `decode_perseq_tps=0.9836`, + `agg_tps=0.9927`. + +- [x] **Step 7: Run post-serving inference gates** + - Artifact: + `/home/mudler/bench/phase28_mmq_occupancy/20260701_040450/gate_minblocks2_post_serving` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 8: Record decision** + - `MINBLOCKS=2` is inference-safe but rejected on throughput. + - `MMQ_Y` is rejected as a low-conflict shortcut because the current NVFP4 + writeback specialization only accepts the stock row tile. + - No llama.cpp source patch or LocalAI patch mirror is justified. + +## Result + +Phase 28 closes the small NVFP4 MMQ occupancy branch. The only buildable knob +kept md5/op gates green but regressed n128 decode serving, and the row-tile knob +does not compile against the current specialization. Future grouped-MMQ work +must be structural kernel work, not a launch-bounds or row-tile build tweak. diff --git a/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md b/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md new file mode 100644 index 000000000000..6c59119a8f0a --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-shape-serving-phase30.md @@ -0,0 +1,51 @@ +# MMQ Shape Serving Phase 30 Plan + +> **For agentic workers:** Use verification-before-completion before claiming +> trace or gate results. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Use patch `0056` to collect live grouped-MMQ selector shapes under the +n128 serving workload and derive the next structural-kernel target shape. + +**Architecture:** Measurement-only. Start `llama-server` with +`LLAMA_MOE_MMQ_SHAPE_TRACE=4096`, run h2h n128, parse the server log, then run +post-serving md5/op gates. + +**Tech Stack:** DGX GB10, llama.cpp CUDA backend, h2h client, +`paged-inference-gates.sh`. + +--- + +## Checklist + +- [x] **Step 1: Check DGX preflight and lock** + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + - owner file set to `codex-phase30-mmq-shape-serving` + +- [x] **Step 2: Run traced n128 serving workload** + - Artifact: `/home/mudler/bench/phase30_mmq_shape_serving/20260701_043300` + - Source: `dgx:~/llama-phase6-source`, commit `826c97a05` + - Env: `LLAMA_MOE_MMQ_SHAPE_TRACE=4096` + - h2h result: `decode_agg_tps=645.8`, `agg_tps=313.3`, + `prefill_tps=1597.9`, `TTFT mean=8192.3 ms` + +- [x] **Step 3: Parse trace distribution** + - Total traced calls: `4096` + - Decode-like (`ncols_max <= 128`): `1200` + - Prefill-like (`ncols_max > 128`): `2896` + - Decode-like selected `mmq_x_best` only in `{32,40,48,64}` with density + `1-4`. + - Prefill-like was mostly density `16` with `mmq_x_best=128`. + - `stream_k=1` for all traced calls. + +- [x] **Step 4: Run post-serving inference gates** + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Result + +The next grouped-MMQ structural experiment should target decode small-M shapes +separately from prefill: `ncols_max` 26-111, density 1-4, selected tile <= 64, +with stream-k/fixup behavior accounted for. diff --git a/docs/superpowers/plans/2026-07-01-mmq-shape-trace-phase29.md b/docs/superpowers/plans/2026-07-01-mmq-shape-trace-phase29.md new file mode 100644 index 000000000000..2815d55332dc --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mmq-shape-trace-phase29.md @@ -0,0 +1,64 @@ +# MMQ Shape Trace Phase 29 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:test-driven-development for source changes and +> superpowers:verification-before-completion before claiming the phase is green. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a default-off, md5-safe MoE grouped-MMQ shape trace so the next +structural grouped-MMQ kernel can be sized from live serving evidence. + +**Architecture:** Host-side instrumentation only. The trace records selector +inputs and estimated density at `mul_mat_q_case`, without reading device +`expert_bounds` or adding synchronization. + +**Tech Stack:** llama.cpp CUDA backend, local host-only unit test, DGX CUDA +build, `paged-inference-gates.sh`. + +--- + +## Checklist + +- [x] **Step 1: Write the RED test** + - Added `tests/test-cuda-mmq-shape-trace.cpp`. + - First build failed on missing `ggml-cuda/mmq-shape-trace.h`, proving the + test covered the new API before implementation. + +- [x] **Step 2: Implement the minimal helper** + - Added `ggml/src/ggml-cuda/mmq-shape-trace.h`. + - Helper computes `n_active_est`, `density`, and formats a stable trace line. + +- [x] **Step 3: Wire default-off instrumentation** + - Added `LLAMA_MOE_MMQ_SHAPE_TRACE=` in `mmq.cuh`. + - Trace is capped by the env value; nonnumeric truthy values default to 256. + - Env unset or `0` stays silent. + +- [x] **Step 4: Verify local GREEN** + - `cmake --build build --target test-cuda-mmq-shape-trace -j 4` + - `./build/bin/test-cuda-mmq-shape-trace` + +- [x] **Step 5: Verify DGX CUDA build** + - Artifact: `/home/mudler/bench/phase29_mmq_shape_trace/20260701_042428` + - `cmake --build build-cuda --target llama-completion test-backend-ops test-cuda-mmq-shape-trace` + +- [x] **Step 6: Run default-off inference gates** + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 7: Run trace-enabled inference gates** + - `EXTRA_ENV=LLAMA_MOE_MMQ_SHAPE_TRACE=4` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + - Trace lines: `4` + +- [x] **Step 8: Mirror into LocalAI** + - Fork commit: `20a99518a feat(cuda): trace moe mmq batch shapes` + - LocalAI patch: `0056-feat-cuda-trace-moe-mmq-batch-shapes.patch` + +## Result + +Phase 29 is instrumentation-only. It does not claim a speed win, but it gives a +bounded and gate-safe way to collect grouped-MMQ selector shape evidence for the +next structural kernel phase. diff --git a/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md new file mode 100644 index 000000000000..b71bf840406d --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-moe-min32-repeat-vllm-phase59.md @@ -0,0 +1,75 @@ +# Phase 59: MoE Min32 Repeat and vLLM H2H + +## Goal + +Repeat the Phase58 MoE `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` result in a +fresh DGX window, then compare against a matching vLLM `n=128`, `ptok=128`, +`gen=64` serving run. + +## Patch Under Test + +The temporary DGX patch stack was generated from the local llama.cpp fork +through: + +- `8759213e3 feat(server): gate TTFT defer by prompt backlog` + +The patch was applied to the clean DGX mirror for llama.cpp runs, then reverted +before the vLLM run. + +## Verification + +Pre and post llama gates stayed green: + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post llama | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +## Results + +Artifact: + +- `/home/mudler/bench/phase59_moe_min32_repeat_vllm/20260701_123147` + +MoE `n=128`, `ptok=128`, `gen=64`: + +| engine / variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|------------------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| llama default | `336.6` | `646.7` | `1525.1` | `7798.5` | `11666.8` | `24.334` | `0` | +| llama min32 | `336.9` | `632.0` | `1567.1` | `7167.8` | `11353.4` | `24.316` | `279` | +| vLLM | `601.3` | `938.8` | `3648.7` | `2968.1` | `4871.6` | `13.563` | n/a | + +Min32 repeat delta versus llama default: + +- Aggregate throughput: `+0.1%` +- Mean TTFT: `-8.1%` +- Max TTFT: `-2.7%` +- Wall time: `-0.1%` +- Prefill throughput: `+2.8%` +- Decode aggregate throughput: `-2.3%` + +Llama min32 versus vLLM: + +- Aggregate throughput ratio: `0.560` +- Mean TTFT: llama is `2.415x` slower +- Wall time: llama is `1.793x` slower +- Prefill throughput ratio: `0.430` +- Decode aggregate throughput ratio: `0.673` + +## Decision + +`LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING=32` repeated as a real, inference-gated +llama.cpp scheduler QoS improvement for MoE `n=128`: it cuts mean TTFT without +moving aggregate throughput or wall time materially. + +It is not a vLLM parity lever by itself. vLLM remains far ahead on the same +serving shape, especially prefill and TTFT. Keep the scheduler path opt-in and +treat it as user-visible latency tuning while parity work returns to the larger +prefill / MoE compute gap. + +## Status + +- Phase59 docs recorded. +- DGX lock released as `FREE phase59-cleanup`. +- No push performed. +- LocalAI `patches/paged/` not regenerated. diff --git a/docs/superpowers/plans/2026-07-01-mtp-draft-smoke-phase9.md b/docs/superpowers/plans/2026-07-01-mtp-draft-smoke-phase9.md new file mode 100644 index 000000000000..0dc699b2e0b1 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-draft-smoke-phase9.md @@ -0,0 +1,243 @@ +# MTP Draft Smoke Phase 9 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Prove the current Qwen3.6 MoE GGUF can exercise llama.cpp `draft-mtp` without breaking normal inference, then keep only the smallest fix needed for the smoke path. + +**Architecture:** Phase 9 is an opt-in speculative-decoding gate, not a default serving feature. The production patch only disables backend draft sampling for MTP because the current backend sampler rejects verification batches with multiple output rows per sequence; target verification and normal greedy inference remain unchanged. + +**Tech Stack:** llama.cpp common speculative runtime, Qwen3.6 MoE NVFP4 GGUF, DGX GB10 CUDA build, canonical LocalAI md5 gates. + +--- + +## Guardrails + +- Keep the patch incremental and additive in the llama.cpp fork. +- Do not enable MTP by default in LocalAI or llama-server. +- Do not enable backend draft sampling for MTP until it supports multi-output verification batches. +- Treat canonical md5 gates as mandatory after any runtime change: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- Record every DGX artifact under `/home/mudler/bench/phase9_mtp_smoke/`. + +## Task 1: Verify MTP Assets + +**Files:** +- Read-only: `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` + +- [x] **Step 1: Check DGX is free** + +Run: + +```bash +ssh dgx.casa 'set -e +echo docker=$(docker ps -q | wc -l) +echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) +echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' +``` + +Result: + +```text +docker=0 +local_ai_worker=0 +compute=0 +FREE released-by-codex-phase6-mmq-grid 1782860601 +``` + +- [x] **Step 2: Confirm nextn tensors exist** + +Run: + +```bash +ssh dgx.casa 'strings /home/mudler/bench/q36-35b-a3b-nvfp4.gguf | grep -i -E "nextn|mtp" | head -n 80 || true' +``` + +Result includes: + +```text +qwen35moe.nextn_predict_layers +blk.40.nextn.eh_proj.weight +blk.40.nextn.shared_head_norm.weight +blk.40.nextn.enorm.weight +blk.40.nextn.hnorm.weight +``` + +## Task 2: Reproduce the Runtime Failure + +**Files:** +- Read-only: `/home/mudler/llama-phase6-source/build-cuda/bin/llama-speculative-simple` +- Artifact: `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke.out` +- Artifact: `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke.err` + +- [x] **Step 1: Build the narrow speculative target** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target llama-speculative-simple -j 8' +``` + +Result: `Built target llama-speculative-simple`. + +- [x] **Step 2: Run default `draft-mtp` smoke before patch** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source +ART=$HOME/bench/phase9_mtp_smoke +mkdir -p "$ART" +timeout 180s env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_CHUNK_MIN=64 GDN_TC=5 GGML_NO_BACKTRACE=1 \ + ./build-cuda/bin/llama-speculative-simple \ + -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + --spec-type draft-mtp --spec-draft-model /home/mudler/bench/q36-35b-a3b-nvfp4.gguf --spec-draft-ngl 99 --spec-draft-n-max 1 \ + -ngl 99 -fa on -c 4096 -n 8 --temp 0 --seed 1 -p "The capital of France is" \ + > "$ART/mtp_smoke.out" 2> "$ART/mtp_smoke.err"' +``` + +Result: process exits `0` but stderr contains repeated backend sampler failures: + +```text +decode: backend sampling requires at most one output token per sequence (seq_id 0 had 2) +``` + +- [x] **Step 3: Prove the expected behavior with backend sampling disabled** + +Run the same command with `--no-spec-draft-backend-sampling`. + +Artifact: + +- `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_no_backend_sampling.out` +- `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_no_backend_sampling.err` + +Result: + +```text +n_drafted = 5 +n_accept = 4 +accept = 80.000% +``` + +Output tail: + +```text +The capital of France is Paris, a city renowned for its rich history +``` + +## Task 3: Disable Backend Sampling for MTP + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/common/speculative.cpp` +- Mirror: `/home/mudler/llama-phase6-source/common/speculative.cpp` + +- [x] **Step 1: Add the guard in the fork** + +Patch: + +```cpp +if (this->params.backend_sampling) { + LOG_WRN("%s: backend draft sampling is disabled for MTP; verification batches can request multiple output rows per sequence\n", + __func__); + this->params.backend_sampling = false; +} +``` + +- [x] **Step 2: Mirror to DGX and rebuild** + +Run: + +```bash +rsync -a /home/mudler/_git/llama.cpp/common/speculative.cpp dgx.casa:/home/mudler/llama-phase6-source/common/speculative.cpp +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && cmake --build . --target llama-speculative-simple llama-completion -j 8' +``` + +Result: both targets built. + +- [x] **Step 3: Re-run default `draft-mtp` smoke** + +Artifact: + +- `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_default_after_patch.out` +- `/home/mudler/bench/phase9_mtp_smoke/mtp_smoke_default_after_patch.err` + +Result: + +```text +rc=0 +MTP_BACKEND_DISABLED_WARN +n_drafted = 5 +n_accept = 4 +accept = 80.000% +``` + +The backend sampler error is absent after the guard. + +## Task 4: Normal Inference Gates + +**Files:** +- Artifact: `/home/mudler/bench/phase9_mtp_smoke/gate_moe_after_patch.txt` +- Artifact: `/home/mudler/bench/phase9_mtp_smoke/gate_dense_after_patch.txt` + +- [x] **Step 1: Run canonical MoE md5** + +Run: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda/bin +ART=$HOME/bench/phase9_mtp_smoke +L="LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_CHUNK_MIN=64 GDN_TC=5 GGML_NO_BACKTRACE=1" +env $L ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 --temp 0 --seed 1 -n 48 -p "The capital of France is" "$ART/gate_moe_after_patch.txt" +md5sum "$ART/gate_moe_after_patch.txt"' +``` + +Result: + +```text +8cb0ce23777bf55f92f63d0292c756b0 +``` + +- [x] **Step 2: Run canonical dense md5** + +Run the same command with `/home/mudler/bench/q36-27b-nvfp4.gguf`. + +Result: + +```text +5951a5b4d624ce891e22ab5fca9bc439 +``` + +## Task 5: Mirror Patch Stack + +**Files:** +- Create: `backend/cpp/llama-cpp-localai-paged/patches/paged/0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patch` + +- [x] **Step 1: Commit in llama.cpp fork** + +Fork commit: + +```text +3eba64aff fix(speculative): disable backend sampling for MTP drafts +``` + +DGX mirror commit: + +```text +3a714c6f9 fix(speculative): disable backend sampling for MTP drafts +``` + +- [x] **Step 2: Generate LocalAI patch** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp format-patch -1 --stdout > \ + /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0054-fix-speculative-disable-backend-sampling-for-MTP-drafts.patch +``` + +## Follow-up Scope + +- MTP remains opt-in and smoke-gated only; do not promote it to default serving. +- A production MTP serving phase must add a server/API gate and a hybrid state rollback test before benchmark claims. +- The next GDN phase should be separate: C=32, `dv_tile=64`, M5-style chunked prefill slab prototype, compared against current M5 at `npp=512` and `npp=2048`. diff --git a/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md b/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md new file mode 100644 index 000000000000..fa78005a7a0c --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-graph-profile-phase16.md @@ -0,0 +1,154 @@ +# MTP Graph-Reuse Profile Phase 16 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:systematic-debugging before proposing source changes. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** validate the Phase 15 hypothesis that current MTP serving regresses +because speculative verification disrupts paged CUDA graph reuse and increases +GPU work. + +**Architecture:** capture a small direct `llama-server` baseline/MTP pair under +`nsys --cuda-graph-trace=node`, using the same request shape except for +`--spec-type draft-mtp`. Do not patch code in this phase. + +**Tech Stack:** llama.cpp `llama-server`, Nsight Systems, DGX GB10, +`h2h_cli3.py`. + +--- + +## Task 1: Preflight + +- [x] **Step 1: Confirm DGX is free** + + Result: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + FREE released-by-codex-phase15-mtp-serving-bench 1782872749 + ``` + +- [x] **Step 2: Confirm profiler is available** + + Result: + + ```text + /usr/local/bin/nsys + ``` + +## Task 2: Capture Baseline and MTP Profiles + +- [x] **Step 1: Run baseline profile** + + Command shape: + + ```bash + nsys profile --force-overwrite=true --cuda-graph-trace=node \ + --trace=cuda,nvtx,osrt --output="$ART/baseline/profile" \ + ./llama-server -m "$MODEL" -ngl 99 -fa on -c 32768 -b 2048 -ub 512 \ + --parallel 32 --host 127.0.0.1 --port 8098 --no-webui + ``` + + Client: + + ```bash + python3 ~/bench/h2h_cli3.py --url http://127.0.0.1:8098/v1/completions \ + --model m -n 8 --ptok 64 --gen 64 --no-cache + ``` + +- [x] **Step 2: Run MTP profile** + + Same as baseline plus: + + ```text + --spec-type draft-mtp --spec-draft-n-max 3 --no-spec-draft-backend-sampling + ``` + +- [x] **Step 3: Save artifacts** + + Artifact root: + + - `/home/mudler/bench/phase16_mtp_graph_profile/20260701_043016` + + Files: + + - `baseline/profile.nsys-rep` + - `baseline/profile.sqlite` + - `baseline/nsys_stats.txt` + - `baseline/client.json` + - `baseline/key_lines.txt` + - `mtp/profile.nsys-rep` + - `mtp/profile.sqlite` + - `mtp/nsys_stats.txt` + - `mtp/client.json` + - `mtp/key_lines.txt` + +## Task 3: Compare Evidence + +- [x] **Step 1: Compare client throughput** + + Result: + + ```text + baseline n=8: decode_agg_tps=230.5, decode_perseq_tps=28.07, wall_s=3.523 + MTP n=8: decode_agg_tps= 97.7, decode_perseq_tps=12.83, wall_s=7.049 + ``` + +- [x] **Step 2: Compare graph reuse** + + Result: + + ```text + baseline: graphs reused = 62 + MTP: graphs reused = 1 + ``` + +- [x] **Step 3: Confirm MTP actually drafted** + + Result: + + ```text + common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00 + draft acceptance = 0.81481 (44 accepted / 54 generated) + statistics draft-mtp: #gen tokens = 460, #acc tokens = 346 + ``` + +- [x] **Step 4: Compare GPU work** + + `nsys stats` kernel summaries show materially more GPU work for the MTP run: + + - baseline top kernel summary total is about `2.59 s` of GPU kernel time, + - MTP top kernel summary total is about `5.89 s` of GPU kernel time. + + This supports the graph/batch-shape hypothesis and rules out a purely + host-side or no-draft explanation. + +## Task 4: Disposition + +- [x] **Step 1: Record root-cause hypothesis as supported** + + Phase 16 supports the Phase 15 root cause: current MTP serving loses the + existing paged decode graph-reuse advantage and does substantially more GPU + work, so it is not a viable GB10 parity lever as implemented. + +- [x] **Step 2: Scope the only plausible code follow-up** + + Do not tune MTP draft parameters first. A source phase would need to inspect + `tools/server/server-context.cpp` speculative batch construction and + `llama-graph` reuse keys to answer: + + - whether verification batches can be bucketed/reused like pure decode, + - whether MTP draft/verify rows force graph rebuilds by changing output rows + per sequence, + - whether target verification can be separated from normal decode graph reuse + without breaking rollback or greedy equivalence. + + If those answers are negative, leave MTP default-off and closed for GB10. + +## Self-Review + +- No source patch was made. +- The profile used `--cuda-graph-trace=node`. +- The result narrows the next work to graph/batch-shape mechanics. diff --git a/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md new file mode 100644 index 000000000000..837e42794a71 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-graph-shape-feasibility-phase17.md @@ -0,0 +1,123 @@ +# MTP Graph-Shape Feasibility Phase 17 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:systematic-debugging before proposing source changes. Steps use +> checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether Phase 16's MTP graph-reuse loss has a small, +maintainable source fix. + +**Architecture:** use read-only code inspection first. Split the problem into +server speculative batch construction and graph-reuse keying. Do not patch until +the shape mechanics are clear. + +**Tech Stack:** llama.cpp `tools/server`, `src/llama-graph.*`, +`ggml-cuda` graph reuse, LocalAI paged docs. + +--- + +## Task 1: Parallel Read-Only Inspection + +- [x] **Step 1: Inspect server speculative batch construction** + + Finding: + + - Normal decode appends one `output=true` row per generating slot. + - Speculative/MTP verification appends `K + 1` `output=true` rows per slot, + where `K = spec_draft.size()`. + - `slot.spec_i_batch` stores the absolute logical row indices for those + verification rows. + - Total batch shape becomes: + + ```text + sum(non_spec_slots * 1) + sum(spec_slots * (1 + K_i)) + prompt rows + ``` + + Key source areas: + + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + around `server_slot::handle_last_sampled_token()`. + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + around the `slot.handle_last_sampled_token(batch)` call site. + - `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + `post_decode()` speculative index validation. + +- [x] **Step 2: Inspect graph-reuse blockers** + + Finding: + + - MTP changes hard graph dimensions: + `n_tokens`, `n_seq_tokens`, `n_outputs`, KQ mask shape, position length, and + output-id count. + - `llm_graph_params::allow_reuse` rejects changes in these dimensions. + - Paged attention bucketing stabilizes block-table view dimensions only; it + does not stabilize verification token/output rows. + - CUDA graph reuse still requires copied node/source properties (`ne`, `nb`, + pointers, node count) to match. + +## Task 2: Feasibility Verdict + +- [x] **Step 1: Reject dummy-row padding as a shortcut** + + Padding fake verification rows is not low-risk: + + - rows are real target decode rows, + - rows have real output logits, + - rows feed MTP nextn embedding/state extraction, + - fake rows would mutate KV, positions, sampling indices, and rollback shape. + + This also resembles the previously rejected fixed-slot decode experiment, + where dummy compute cost exceeded graph-reuse recovery. + +- [x] **Step 2: Identify the only small safe hook** + + A read-only shape counter around `server_slot::handle_last_sampled_token()` is + low-conflict and can expose: + + - normal vs speculative rows, + - draft length `K`, + - output rows per sequence, + - `slot.spec_i_batch` range. + + This is useful instrumentation, not a performance fix. + +- [x] **Step 3: Identify the only plausible behavior experiment** + + The least invasive performance experiment is server-side scheduling, not graph + padding: + + - group or defer speculative verification slots by `1 + spec_draft.size()`, + - try to make verification windows repeat shape buckets, + - keep it opt-in and default-off, + - gate with Phase 14 rollback, Phase 15 serving A/B, and pre/post inference + md5/op checks. + + This changes serving scheduling and may regress TTFT or reduce concurrency, so + it needs an explicit kill gate. + +## Task 3: Phase 18 Scope If Pursued + +- [x] **Step 1: Write the source-scope boundary** + + Phase 18 should be split into two incremental patches if it is attempted: + + 1. instrumentation-only: log or count verification shape buckets under a + disabled-by-default env var, no scheduling change, + 2. opt-in scheduler experiment: group/defer MTP verification by draft length. + +- [x] **Step 2: Define stop criteria** + + Stop and reject the source path if: + + - shape counters show high entropy across draft lengths and active slots, + - grouping reduces graph churn but loses more throughput/TTFT than it recovers, + - pre/post md5 or `MUL_MAT_ID` gates drift, + - MTP rollback or normalized greedy-prefix gates fail. + +## Self-Review + +- No source patch was made in this phase. +- The feasibility conclusion is narrower than "optimize MTP": instrument first, + then only consider an opt-in scheduler experiment. +- No default behavior changes are proposed without a separate implementation + phase and gates. diff --git a/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md b/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md new file mode 100644 index 000000000000..3b4c900ffdc2 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-rollback-serving-gates-phase14.md @@ -0,0 +1,195 @@ +# MTP Rollback and Serving Gates Phase 14 Plan + +> **For agentic workers:** keep checkboxes current while executing. This phase +> is safety-gated and must not claim an MTP parity win. + +**Goal:** prove that MTP speculative decode can reject drafts without corrupting +Qwen3.6 paged KV or recurrent GDN state. + +**Design:** `docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md` + +## Required Safety Gates + +- DGX must have no running docker containers, no `local-ai-worker`, no GPU + compute PIDs, and a free or absent `~/gpu_bench_lock/owner`. +- Use `/home/mudler/llama-phase6-source` on DGX and keep it clean unless a + source patch is explicitly required. +- Do not benchmark MTP as a parity win in this phase. +- Do not enable MTP by default in LocalAI or llama-server. + +## Task 1: Preflight and Existing Rollback Gate + +- [x] **Step 1: Confirm DGX is free** + + Result: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + FREE released-by-codex-phase6-mmq-grid 1782860601 + ``` + +- [x] **Step 2: Run recurrent rollback test on actual MoE GGUF** + + Command: + + ```bash + ssh dgx.casa 'cd /home/mudler/llama-phase6-source/build-cuda && + cmake --build . --target test-recurrent-state-rollback -j 8 && + ./bin/test-recurrent-state-rollback \ + -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -ngl 99 -fa on -c 4096 -b 64 -ub 64 \ + > /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.out \ + 2> /home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err' + ``` + + Current evidence from the same command family: + + - Artifact: + `/home/mudler/bench/phase14_mtp_rollback/recurrent_rollback.err`. + - Result: + `main : recurrent rollback checkpoint restored successfully`. + +## Task 2: MTP Greedy-Equivalence Gate + +- [x] **Step 1: Build required binaries** + + Build `llama-completion`, `llama-speculative-simple`, and + `test-recurrent-state-rollback`. + +- [x] **Step 2: Run baseline greedy completion** + + Save stdout/stderr and md5 under + `/home/mudler/bench/phase14_mtp_rollback/greedy_baseline.*`. + + Additional raw text-generation baselines were saved under + `/home/mudler/bench/phase14_mtp_rollback/completion_nocnv_n{8,16,24,32,48}.*` + because `llama-completion` defaults to conversation mode for this model unless + `-no-cnv` is passed. + +- [x] **Step 3: Run MTP speculative completion with the same prompt/seed** + + Use: + + - `--spec-type draft-mtp` + - `--spec-draft-model /home/mudler/bench/q36-35b-a3b-nvfp4.gguf` + - `--spec-draft-ngl 99` + - `--spec-draft-n-max 3` + - `--temp 0 --seed 1` + + Save stdout/stderr and md5 under + `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.*`. + +- [x] **Step 4: Compare outputs** + + Exact transcript md5 is not a valid cross-frontend comparator here: + + - `llama-speculative-simple --spec-type none` is not a working no-draft + baseline; it still tries to load an empty draft model and exits with + `failed to load draft model, ''`. + - `--spec-draft-n-max 0` is not a no-draft baseline either; the recorded run + still drafted and accepted tokens (`n_drafted=17`, `n_accept=17`). + - `llama-speculative-simple` counts/emits accepted token groups, so the same + `-n` can produce a longer raw completion than `llama-completion -no-cnv`. + + Normalized raw-output prefix gate passed for `n=8,16,24,32,48`; no run showed + a first differing token. The MTP output had the `llama-completion -no-cnv` + output as a prefix in each case. The `n=32` MTP artifact was + `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.out`. + +## Task 3: MTP Partial-Rejection Gate + +- [x] **Step 1: Confirm rejection occurred** + + Parse MTP stderr and require: + + - `n_drafted > 0` + - `n_accept >= 0` + - `n_drafted > n_accept` + + Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`: + + ```text + n_drafted = 39 + n_accept = 20 + accept = 51.282% + ``` + +- [x] **Step 2: Confirm no backend sampler error** + + Fail if stderr contains: + + ```text + backend sampling requires at most one output token per sequence + ``` + + Result: absent from the MTP stderr. The expected warning was present instead: + `backend draft sampling is disabled for MTP`. + +- [x] **Step 3: Record whether bounded recurrent rollback is active** + + Record `n_rs_seq` or the log line showing bounded partial sequence removal. + + Result from `/home/mudler/bench/phase14_mtp_rollback/mtp_greedy_equiv.err`: + + ```text + common_context_can_seq_rm: the context supports bounded partial sequence removal + ``` + +## Task 4: Standard Inference Gates + +- [x] **Step 1: Run paged inference gate helper** + + Run: + + ```bash + /tmp/paged-inference-gates.sh + ``` + + Expected: + + - MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. + - `MUL_MAT_ID` `806/806`. + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + artifacts: /home/mudler/bench/paged_inference_gates/20260701_041117 + ``` + +## Task 5: Disposition + +- [x] **Step 1: If all gates pass** + + Update: + + - `GB10_PARITY_PHASE0_RESULTS.md` + - `VLLM_PARITY_LEVER_MAP.md` + - `PARITY_HANDOFF.md` + + Record that MTP rollback safety is green and Phase 15 can be a serving/API + benchmark, still default-off. + +- [x] **Step 2: If any gate fails** + + Stop before performance benchmarking, save artifacts, and either implement a + narrow fork-first fix or record the failed gate as a blocker for MTP parity. + + Reviewed and not taken. The original exact-md5 wording was too strict for + this example harness, but there was no token divergence after raw-output + normalization. Do not add a production source patch in Phase 14. Carry the + frontend/token accounting finding into Phase 15 and benchmark serving only + behind the same canonical inference gates. + +## Self-Review + +- No placeholders remain. +- Scope is limited to rollback and greedy-equivalence safety. +- Phase 14 does not claim or benchmark speed parity. diff --git a/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md b/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md new file mode 100644 index 000000000000..2bd72509c3cb --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-serving-shape-entropy-phase19.md @@ -0,0 +1,139 @@ +# MTP Serving Shape Entropy Phase 19 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** use Phase 18's `LLAMA_SPEC_SHAPE_TRACE=1` instrumentation under real +serving load to decide whether a group/defer-by-draft-length scheduler +experiment is justified. + +**Architecture:** trace-only benchmark. Do not change llama.cpp source or +scheduling policy. Run the existing MTP serving A/B with pre/post canonical +inference gates. + +**Tech Stack:** `paged-mtp-serving-bench.sh`, llama.cpp `llama-server`, DGX +GB10, LocalAI paged patch stack. + +--- + +## Task 1: Run Trace-Only Serving A/B + +- [x] **Step 1: Confirm DGX is free** + + Preflight passed: + + - `docker=0` + - `local_ai_worker=0` + - `compute=0` + +- [x] **Step 2: Run serving harness with shape trace** + + Command shape: + + ```bash + LLAMA_SPEC_SHAPE_TRACE=1 \ + ART=~/bench/phase19_mtp_shape_entropy/20260701_045534 \ + NPL="8 32 128" GEN=64 PTOK=128 \ + /tmp/paged-mtp-serving-bench.sh + ``` + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534` + +## Task 2: Verify Inference Gates + +- [x] **Step 1: Pre-gate passed** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_pre` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +- [x] **Step 2: Post-gate passed** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/gate_post` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 3: Analyze Serving Result + +- [x] **Step 1: Compare baseline vs MTP serving throughput** + + | n | baseline decode_agg | MTP decode_agg | MTP / baseline | baseline TTFT ms | MTP TTFT ms | + |---|---------------------|----------------|----------------|------------------|-------------| + | 8 | 245.0 | 95.7 | 39.1% | 1147.2 | 1633.4 | + | 32 | 409.2 | 110.0 | 26.9% | 2710.0 | 4471.5 | + | 128 | 697.2 | 154.0 | 22.1% | 7601.5 | 20310.4 | + + MTP remained materially slower at every concurrency. + +- [x] **Step 2: Parse per-slot draft entropy** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/shape_entropy_summary.tsv` + + Result: + + | window | verify slots | draft counts | top draft share | unique `batch_before` | + |--------|--------------|--------------|-----------------|-----------------------| + | n8 | 162 | `{1: 4, 2: 2, 3: 156}` | 96.3% | 15 | + | n32 | 610 | `{1: 8, 2: 11, 3: 591}` | 96.9% | 96 | + | n128 | 2353 | `{1: 40, 2: 49, 3: 2264}` | 96.2% | 479 | + + Draft length is already overwhelmingly `3`. Grouping by draft length has + little to recover. + +- [x] **Step 3: Parse per-step aggregate shapes** + + Artifact: + + - `/home/mudler/bench/phase19_mtp_shape_entropy/20260701_045534/step_shape_summary.tsv` + + Result: + + | window | steps | unique total rows | top full-shape rows | + |--------|-------|-------------------|---------------------| + | n8 | 26 | 12 | `32` rows for 14 steps | + | n32 | 32 | 20 | `128` rows for 13 steps | + | n128 | 37 | 34 | `512` rows for 4 steps | + + Full in-flight steps already consist mostly of all-`draft=3` vectors. The + remaining shape churn is active-slot/tail churn plus the speculative `K + 1` + output-row expansion itself, not a draft-length scheduling problem. + +## Task 4: Decision + +- [x] **Step 1: Reject Phase 20 scheduler experiment for now** + + Do not build the group/defer-by-draft-length scheduler experiment on this + evidence: + + - draft length is already stable (`draft=3` >96% of verify slots), + - MTP still regresses decode throughput to 22-39% of baseline, + - TTFT gets worse at every concurrency, + - per-step shape variation is dominated by active-slot/tail churn and row + expansion, not mixed draft lengths. + + The next useful MTP work would need a deeper target-verify graph/state design, + not a small server scheduling shortcut. + +## Self-Review + +- No source behavior changed in this phase. +- Pre/post md5 and op gates passed. +- The phase result moves the plan by rejecting the scheduler follow-up rather + than leaving it as an attractive but unsupported idea. diff --git a/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md b/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md new file mode 100644 index 000000000000..0fa142b9925a --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-serving-throughput-phase15.md @@ -0,0 +1,191 @@ +# MTP Serving Throughput Phase 15 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:subagent-driven-development or superpowers:executing-plans to +> implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for +> tracking. + +**Goal:** measure whether Phase 14's safe MTP path improves real +`llama-server` serving throughput on GB10. + +**Architecture:** use direct `llama-server` first, not LocalAI, so the benchmark +isolates llama.cpp serving behavior. Compare two same-shape arms: baseline with +no speculative decoding and MTP with `--spec-type draft-mtp`. Run canonical +inference gates before and after the A/B. + +**Tech Stack:** llama.cpp `llama-server`, DGX GB10, `h2h_cli3.py`, +`paged-inference-gates.sh`. + +--- + +## Files + +- Create: `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_FINAL.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +## Task 1: Confirm Server MTP Wiring + +- [x] **Step 1: Dispatch independent codebase checks** + + Two explorer agents inspected: + + - llama.cpp server speculative/MTP wiring. + - existing serving benchmark harnesses and safety-gate discipline. + +- [x] **Step 2: Record startup-only control** + + Finding: + + - `llama-server` supports MTP when started with `--spec-type draft-mtp`. + - HTTP request JSON cannot enable speculation per request because the + speculative request fields in `tools/server/server-schema.cpp` are under + `#if 0`. + +- [x] **Step 3: Run a one-request server smoke** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving_smoke` + + Evidence: + + ```text + common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp' + common_context_can_seq_rm: the context supports bounded partial sequence removal + timings.draft_n = 33 + timings.draft_n_accepted = 19 + ``` + +## Task 2: Add Repeatable DGX Runner + +- [x] **Step 1: Create runner** + + Created: + + - `backend/cpp/llama-cpp-localai-paged/paged-mtp-serving-bench.sh` + + Responsibilities: + + - check docker, `local-ai-worker`, compute PIDs, and GPU lock owner, + - run pre/post `paged-inference-gates.sh`, + - run baseline and MTP `llama-server` arms, + - drive `/v1/completions` with `/home/mudler/bench/h2h_cli3.py`, + - capture server logs, client JSON, MTP acceptance lines, and a summary TSV. + +- [x] **Step 2: Fix lock ordering** + + First attempt stopped before benchmarking because the runner acquired the GPU + lock and then called `paged-inference-gates.sh`, whose own preflight correctly + rejects a non-free lock owner. + + Fix: run the pre-gate before acquiring the benchmark lock and the post-gate + after releasing it. + +## Task 3: Run Serving A/B + +- [x] **Step 1: Run canonical pre-gate** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005/gate_pre` + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +- [x] **Step 2: Run baseline and MTP arms** + + Command shape: + + ```bash + NPL="8 32 128" PTOK=128 GEN=128 CTX=131072 PARALLEL=128 \ + ~/paged-mtp-serving-bench.sh + ``` + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005` + + Summary: + + ```text + arm n agg_tps decode_agg_tps decode_perseq_tps ttft_mean_ms wall_s + baseline 8 192.5 247.8 30.70 1181.1 5.318 + mtp 8 92.9 109.8 14.26 1691.5 11.017 + baseline 32 305.4 406.0 12.02 2762.2 13.412 + mtp 32 95.8 111.7 3.61 4545.6 42.727 + baseline 128 429.5 662.4 4.31 7747.2 38.144 + mtp 128 100.3 138.5 0.97 20385.7 163.289 + ``` + +- [x] **Step 3: Confirm MTP actually drafted** + + MTP server log showed: + + ```text + common_speculative_impl_draft_mtp: - n_max=3, n_min=0, p_min=0.00 + statistics draft-mtp: #gen tokens = 17293, #acc tokens = 15493 + ``` + + Acceptance was high enough that this is not a no-draft false negative. + +- [x] **Step 4: Run canonical post-gate** + + Artifact: + + - `/home/mudler/bench/phase15_mtp_serving/20260701_042005/gate_post` + + Result: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +## Task 4: Disposition + +- [x] **Step 1: Reject current MTP serving as a parity lever** + + Current `llama-server` MTP is slower at every tested concurrency: + + - `n=8`: decode aggregate `247.8 -> 109.8` tok/s. + - `n=32`: decode aggregate `406.0 -> 111.7` tok/s. + - `n=128`: decode aggregate `662.4 -> 138.5` tok/s. + +- [x] **Step 2: Record likely root cause** + + Baseline logs show heavy graph reuse in the serving run (`graphs reused = 361` + in the `n=128` tail). MTP logs show `graphs reused = 1` and per-slot eval + around `900-1200 ms/token` at high concurrency. The working hypothesis is that + MTP verification/draft batch shape churn defeats the paged decode graph-reuse + wins, and the extra target verification work dominates despite high acceptance. + +- [x] **Step 3: Scope follow-up** + + Do not continue by tuning `spec-draft-n-max` blindly. The next scoped phase, + if pursued, must first inspect MTP serving graph reuse and batch shapes: + + - confirm whether speculative verification batches bypass the reusable + pure-decode graph key, + - measure with `nsys --cuda-graph-trace=node`, + - test whether MTP can share the default decode graph path or must remain a + non-parity feature on GB10. + +## Self-Review + +- No placeholders remain. +- Phase 15 does not enable MTP by default. +- Phase 15 keeps pre/post md5 and `test-backend-ops` gates. +- Result is a rejected serving-throughput lever, not a parity win. diff --git a/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md b/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md new file mode 100644 index 000000000000..c4aaa0c44618 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-shape-trace-phase18.md @@ -0,0 +1,112 @@ +# MTP Shape Trace Phase 18 Plan + +> **For agentic workers:** REQUIRED SUB-SKILLS: Use +> superpowers:test-driven-development before source edits and +> superpowers:verification-before-completion before commit. Steps use checkbox +> (`- [ ]`) syntax for tracking. + +**Goal:** add a default-off, inference-safe trace for speculative/MTP server +batch shape entropy before considering any scheduler experiment. + +**Architecture:** keep this as a server-only instrumentation patch in +`server_slot::handle_last_sampled_token()`. Do not change speculative +acceptance, rollback, logits, KV writes, graph-reuse keys, or scheduling. + +**Tech Stack:** llama.cpp `tools/server/server-context.cpp`, LocalAI paged +patch stack, DGX GB10 validation. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove the trace does not already exist** + + Ran a direct MTP `llama-server` request on DGX with + `LLAMA_SPEC_SHAPE_TRACE=1` before the source patch. + + Result: + + - no `spec shape:` lines were emitted, + - artifact: `/home/mudler/bench/phase18_mtp_shape_trace_red`. + +## Task 2: Instrumentation Patch + +- [x] **Step 1: Add an env-gated trace** + + Added `LLAMA_SPEC_SHAPE_TRACE=1` logging in + `server_slot::handle_last_sampled_token()`: + + - normal decode rows: `kind=decode`, `rows=1`, `outputs=1`, `draft=0`, + - speculative verification rows: `kind=verify`, `rows=K+1`, + `outputs=K+1`, `draft=K`, `spec_i_first`, `spec_i_last`. + + The env var is default-off and does not alter batch contents. + +- [x] **Step 2: Keep the patch incremental** + + Local fork commit: + + - `fb9402661 feat(server): trace speculative batch shapes` + + LocalAI patch: + + - `0055-feat-server-trace-speculative-batch-shapes.patch` + +## Task 3: Green Checks + +- [x] **Step 1: Build and validate trace behavior on DGX** + + DGX mirror commit: + + - `f2521ab12 feat(server): trace speculative batch shapes` + + Build: + + - `cmake --build build-cuda --target llama-server -j$(nproc)` + + Trace-enabled result: + + ```text + spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=5 slot_tokens=5 + spec shape: kind=verify batch_before=0 rows=4 outputs=4 draft=3 spec_i_first=0 spec_i_last=3 pos0=6 slot_tokens=6 + spec shape: kind=verify batch_before=0 rows=3 outputs=3 draft=2 spec_i_first=0 spec_i_last=2 pos0=9 slot_tokens=9 + ``` + + Trace-disabled result: + + ```text + trace disabled: no spec shape lines + ``` + + Artifact: + + - `/home/mudler/bench/phase18_mtp_shape_trace_green` + +- [x] **Step 2: Run canonical inference gates** + + Artifact: + + - `/home/mudler/bench/phase18_mtp_shape_trace_green/gate_after` + + Result: + + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT_ID`: `806/806` + +## Task 4: Follow-Up Boundary + +- [x] **Step 1: Scope Phase 19** + + Use the trace to measure shape entropy under real serving load before any + behavior change. A Phase 19 scheduler experiment is allowed only if the trace + shows repeatable draft-length buckets worth grouping. It must be opt-in, + default-off, and killed by TTFT/throughput regression, md5/op drift, or MTP + rollback/prefix failure. + +## Self-Review + +- No default behavior changed. +- The trace is read-only with respect to batch contents and slot state. +- The post-patch canonical md5/op gates passed, so this instrumentation did not + break inferencing on the gated paths. diff --git a/docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md b/docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md new file mode 100644 index 000000000000..0ca49dad836f --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md @@ -0,0 +1,406 @@ +# MTP Verify Cost Phase 62 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** decide whether MTP can still be a GB10 parity lever by separating draft acceptance from target-verify graph cost, while proving default inference remains unchanged. + +**Architecture:** use the existing llama.cpp MTP implementation and existing server-side speculative telemetry first. Do not change inference behavior in this phase. Run default greedy-md5 and backend-op gates before and after each DGX serving sweep, then compare baseline, current MTP, and bounded MTP configurations on throughput, graph reuse, acceptance rate, mean acceptance length, per-position acceptance, and output-row expansion. + +**Tech Stack:** llama.cpp `llama-server`, LocalAI paged harnesses, DGX GB10, `h2h_cli3.py`, `paged-inference-gates.sh`, existing `LLAMA_SPEC_SHAPE_TRACE=1` instrumentation. + +--- + +## Files + +- Modify: `docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Possible modify after this phase, only if Task 4 approves it: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` +- Possible test after this phase, only if Task 4 approves source changes: `/home/mudler/_git/llama.cpp/tests/` + +## Existing Evidence To Respect + +- Phase 15 proved MTP drafts are accepted but serving throughput regresses: + + ```text + statistics draft-mtp: #gen tokens = 17293, #acc tokens = 15493 + n=128 baseline decode_agg_tps 662.4 + n=128 mtp decode_agg_tps 138.5 + ``` + +- Phase 19 proved draft length entropy is not the obvious culprit: + + ```text + n128 draft counts: {1: 40, 2: 49, 3: 2264} + top draft share: 96.2% + unique total rows: 34 + ``` + +- Current llama.cpp already prints: + + ```text + draft acceptance = ... mean acceptance length = ... acceptance rate per position = (...) + statistics draft-mtp: ... #gen tokens = ... #acc tokens = ... #mean acc len = ... #acc rate/pos = (...) + graphs reused = ... + ``` + + Therefore this phase must not waste time adding duplicate acceptance counters unless the sweep shows a missing attribution that blocks the decision. + +## Task 1: Safety Preflight And Baseline Gates + +- [x] **Step 1: Confirm DGX is idle** + + Run on `dgx.casa`: + + ```bash + docker_count=$(docker ps -q | wc -l) + local_ai=$(docker ps --format '{{.Names}}' | grep -c local-ai-worker || true) + compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed '/^$/d' | wc -l) + owner=FREE-no-lock-file + test -f "$HOME/gpu_bench_lock/owner" && owner=$(cat "$HOME/gpu_bench_lock/owner") + printf 'docker=%s\nlocal_ai_worker=%s\ncompute=%s\nowner=%s\n' "$docker_count" "$local_ai" "$compute" "$owner" + ``` + + Expected: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + owner=FREE... + ``` + + Observed on 2026-07-01: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + owner=FREE released-by-codex-phase53-budget-sweep 1782897825 + ``` + +- [x] **Step 2: Run default inference gates before any MTP sweep** + + Run on `dgx.casa`: + + ```bash + ART_ROOT="$HOME/bench/phase62_mtp_verify_cost/$(date +%Y%m%d_%H%M%S)" + mkdir -p "$ART_ROOT" + ART="$ART_ROOT/gate_pre" "$HOME/paged-inference-gates.sh" > "$ART_ROOT/gate_pre.log" 2>&1 + cat "$ART_ROOT/gate_pre.log" + printf '%s\n' "$ART_ROOT" > "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt" + ``` + + Expected: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + + Artifact: + + ```text + /home/mudler/bench/phase62_mtp_verify_cost/20260701_134125/gate_pre + ``` + + Observed: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +- [x] **Step 3: Record the artifact path in this plan** + + Add the concrete artifact path and the gate output summary under `Task 1` after the run finishes. + +## Task 2: Run A Small MTP Verify-Cost Sweep + +- [x] **Step 1: Let the serving harness acquire the GPU lock** + + Do not write `~/gpu_bench_lock/owner` manually before calling + `paged-mtp-serving-bench.sh`. That harness runs its own preflight, takes the + owner-file lock after its pre-gate, and releases it before its post-gate. + Confirm the lock is still free: + + ```bash + owner=FREE-no-lock-file + test -f "$HOME/gpu_bench_lock/owner" && owner=$(cat "$HOME/gpu_bench_lock/owner") + printf 'owner=%s\n' "$owner" + ``` + + Observed before harness start: + + ```text + owner=FREE released-by-codex-phase53-budget-sweep 1782897825 + ``` + +- [x] **Step 2: Run baseline and MTP arms with shape trace** + + Use the existing harness first so Phase 62 is comparable to Phases 15 and 19: + + ```bash + ART="$(cat "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt")" + LLAMA_SPEC_SHAPE_TRACE=1 \ + ART="$ART" \ + NPL="8 32 128" \ + GEN=64 \ + PTOK=128 \ + SRC="$HOME/llama-phase6-source" \ + "$HOME/paged-mtp-serving-bench.sh" + ``` + + Expected files: + + ```text + $ART/summary.tsv + $ART/baseline/server.log + $ART/mtp/server.log + $ART/mtp/spec_lines.txt + $ART/gate_pre.log + $ART/gate_post.log + ``` + + Artifact: + + ```text + /home/mudler/bench/phase62_mtp_verify_cost/20260701_134125 + ``` + + Summary: + + ```text + arm n decode_agg_tps ttft_mean_ms wall_s + baseline 8 248.5 1150.4 3.214 + mtp 8 104.4 1682.9 6.591 + baseline 32 411.8 2607.9 8.116 + mtp 32 112.8 4444.7 24.078 + baseline 128 696.5 7425.2 24.570 + mtp 128 148.1 20155.8 99.787 + ``` + +- [x] **Step 3: Confirm the GPU lock was released** + + Run after the harness exits: + + ```bash + pkill -f 'llama-server.*8097' || true + owner=FREE-no-lock-file + test -f "$HOME/gpu_bench_lock/owner" && owner=$(cat "$HOME/gpu_bench_lock/owner") + printf 'owner=%s\n' "$owner" + ``` + + Expected: `owner=FREE...`. + + Observed: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + owner=FREE released-by-codex-phase15-mtp-serving-bench 1782906420 + ``` + +- [x] **Step 4: Confirm the post-gate remained green** + + Inspect: + + ```bash + ART="$(cat "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt")" + cat "$ART/gate_post.log" + ``` + + Expected: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + + Observed: + + ```text + moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 + dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 806/806 tests passed + Backend CUDA0: OK + paged inference gates OK + ``` + +## Task 3: Parse Acceptance, Graph Reuse, And Output-Row Cost + +- [x] **Step 1: Extract throughput** + + Run on `dgx.casa`: + + ```bash + ART="$(cat "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt")" + cat "$ART/summary.tsv" + ``` + + Record `decode_agg_tps`, `ttft_mean_ms`, and `wall_s` for each arm and concurrency. + + MTP decode ratios: + + ```text + n=8 104.4 / 248.5 = 0.420 + n=32 112.8 / 411.8 = 0.274 + n=128 148.1 / 696.5 = 0.213 + ``` + +- [x] **Step 2: Extract MTP acceptance and graph reuse** + + Run on `dgx.casa`: + + ```bash + ART="$(cat "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt")" + grep -E 'draft acceptance|statistics[[:space:]]+draft-mtp|graphs reused' "$ART/mtp/server.log" > "$ART/mtp/phase62_mtp_stats.txt" || true + cat "$ART/mtp/phase62_mtp_stats.txt" + ``` + + Record: + + ```text + draft_n generated + draft_n accepted + draft acceptance ratio + mean acceptance length + acceptance rate per position + graphs reused + ``` + + Final cumulative MTP statistics: + + ```text + #gen tokens = 9340 + #acc tokens = 7372 + acceptance = 0.78929 + #mean acc len = 3.33 + #acc rate/pos = (0.877, 0.767, 0.691) + graphs reused = 1 + ``` + +- [x] **Step 3: Parse shape-trace row expansion** + + Run on `dgx.casa`: + + ```bash + ART="$(cat "$HOME/bench/phase62_mtp_verify_cost/latest_artifact.txt")" + python3 - "$ART/mtp/server.log" > "$ART/phase62_shape_rows.tsv" <<'PY' + import re + import sys + from collections import Counter + path = sys.argv[1] + total_rows = Counter() + draft_rows = Counter() + for line in open(path, errors="ignore"): + if "spec shape:" not in line: + continue + m = re.search(r"batch_after=(\d+)", line) + if m: + total_rows[int(m.group(1))] += 1 + m = re.search(r"draft=(\d+)", line) + if m: + draft_rows[int(m.group(1))] += 1 + print("kind\tvalue\tcount") + for value, count in sorted(total_rows.items()): + print(f"batch_after\t{value}\t{count}") + for value, count in sorted(draft_rows.items()): + print(f"draft\t{value}\t{count}") + PY + cat "$ART/phase62_shape_rows.tsv" + ``` + + The initial parser used the wrong uppercase marker. The real marker is: + + ```bash + grep -m 20 'spec shape:' "$ART/mtp/server.log" + ``` + + Final shape summary: + + ```text + rows total 3212; rows=4: 3070 (95.6%) + draft total 3212; draft=3: 3070 (95.6%) + batch_after total 3212; unique values 495 + ``` + +## Task 4: Decide Whether A Source Patch Is Justified + +- [x] **Step 1: Apply the MTP keep/reject rule** + + Keep MTP as a candidate only if all of these are true: + + ```text + acceptance ratio >= 0.70 + mean acceptance length >= 2.0 + MTP decode_agg_tps >= 0.75 * baseline decode_agg_tps for at least n=8 or n=32 + post-gate md5 and MUL_MAT_ID remain green + ``` + + Reject another MTP implementation phase for now if acceptance is high but throughput remains below `0.75x` baseline, because that means verification/output-row cost still dominates. + + Decision: reject another MTP implementation phase for now. Acceptance is high + (`0.789`, mean acceptance length `3.33`) but throughput is only `0.420x`, + `0.274x`, and `0.213x` baseline decode at `n=8`, `n=32`, and `n=128`. + +- [x] **Step 2: If rejected, mark the reason in docs** + + Update: + + - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + + Required wording: + + ```text + Phase62 kept default inference green with md5/op gates, but MTP remains rejected unless a later design removes target-verify/output-row graph cost. Do not tune n_max blindly. + ``` + +- [x] **Step 3: If kept, scope Phase 63 as a TDD source patch** + + The keep rule did not pass, so no Phase 63 MTP source patch is scoped. + +## Task 5: Commit Documentation + +- [ ] **Step 1: Verify docs are clean** + + Run locally: + + ```bash + git diff --check + ``` + + Expected: no output. + +- [ ] **Step 2: Commit the phase docs** + + Run locally: + + ```bash + git add docs/superpowers/plans/2026-07-01-mtp-verify-cost-phase62.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md + git commit -m "docs(paged): scope MTP verify-cost phase" \ + -m "Assisted-by: Codex:gpt-5" + ``` + +## Self-Review + +- [x] No source behavior changes are planned before measurement. +- [x] The phase explicitly gates default inference with MoE md5, dense md5, and backend op checks before and after. +- [x] The plan uses current code reality: acceptance and per-position stats already exist. +- [x] The go/no-go rule prevents blind MTP `n_max` tuning after Phase 15 and Phase 19. diff --git a/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md b/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md new file mode 100644 index 000000000000..6f0adeb4a416 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-mul-mat-route-trace-phase35.md @@ -0,0 +1,54 @@ +# Phase 35: Regular MUL_MAT Route Trace + +**Goal:** Split the projection-heavy regular `MUL_MAT` serving bucket into concrete dispatch routes before attempting a projection optimization. + +**Scope:** llama.cpp fork first, then LocalAI patch `0061`. Instrumentation only; no route or numeric behavior change. + +## Plan + +- [x] Inspect regular `ggml_cuda_mul_mat` dispatch order and projection bucket docs. +- [x] Dispatch sidecar explorers for llama.cpp route context and vLLM projection context. +- [x] Add failing host-only tests for regular `MUL_MAT` route classification. +- [x] Implement `ggml_cuda_mul_mat_route_shape_make()` and formatter. +- [x] Wire default-off `LLAMA_MUL_MAT_ROUTE_TRACE=` around regular `MUL_MAT`. +- [x] Build and run `test-cuda-mmq-shape-trace` locally. +- [x] Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace` on DGX. +- [x] Run default-off and trace-enabled md5/op gates. +- [x] Run n128 serving route trace and parse route/type/shape counts. +- [x] Run post-serving md5/op gates. +- [x] Commit fork and DGX mirror, export LocalAI patch `0061`. +- [x] Update README, parity docs, handoff, and patch maintenance. +- [x] Re-run strict patch-series mirror invariant. + +## Results + +Artifact: `/home/mudler/bench/phase35_mul_mat_route_trace/20260701_074359`. + +Commits: + +- Fork: `486c28c63 feat(cuda): trace mul mat routes` +- DGX mirror: `18f7ad005 feat(cuda): trace mul mat routes` +- LocalAI patch: `backend/cpp/llama-cpp-localai-paged/patches/paged/0061-feat-cuda-trace-mul-mat-routes.patch` + +Gates: + +- Default-off MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Default-off dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- Trace-enabled MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Trace-enabled dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- Post-serving MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- Post-serving dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` in default, trace-enabled, and post-serving gates +- `MUL_MAT_ID`: `806/806` in default, trace-enabled, and post-serving gates + +n128 route trace: + +- `mat_f`: 2888 +- `op_cublas`: 2292 +- `mmq`: 1328 +- `vec_q`: 1214 +- `vec_f`: 470 + +BF16 (`type=30`) was the largest traced type: 3965 calls, split into `mat_f=2485` and `op_cublas=1330`. Top BF16 shapes were `mat_f ne1=12` 775, `op_cublas ne1=18` 760, and `mat_f ne1=8` 570. + +Decision: next projection work should add cuBLAS/MMF subroute detail or test a narrow BF16 route policy for generic `op_cublas` shapes. Do not spend effort on batched cuBLAS for this measured n128 serving slice. diff --git a/docs/superpowers/plans/2026-07-01-patch-series-mirror-invariant-phase22.md b/docs/superpowers/plans/2026-07-01-patch-series-mirror-invariant-phase22.md new file mode 100644 index 000000000000..149d84488f75 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-patch-series-mirror-invariant-phase22.md @@ -0,0 +1,78 @@ +# Patch Series Mirror Invariant Phase 22 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** prove the LocalAI `patches/paged/` series still reconstructs the +canonical llama.cpp fork tree after adding patch `0055`. + +**Architecture:** use the same strict `git apply` method as the LocalAI +Makefile. Apply every on-disk paged patch to a fresh worktree at +`LLAMA_VERSION`, then compare the resulting tree hash with the fork branch HEAD. + +**Tech Stack:** Git worktrees, `git apply`, LocalAI paged patch stack, +llama.cpp fork branch `localai-paged`. + +--- + +## Task 1: Apply Patch Series + +- [x] **Step 1: Read the pinned base** + + Source: + + - `backend/cpp/llama-cpp-localai-paged/Makefile` + + Result: + + - `LLAMA_VERSION=0ed235ea2c17a19fc8238668653946721ed136fd` + +- [x] **Step 2: Apply all patches with strict `git apply`** + + Command shape: + + ```bash + git -C /home/mudler/_git/llama.cpp worktree add --detach \ + /tmp/llama-paged-series-applycheck \ + 0ed235ea2c17a19fc8238668653946721ed136fd + + for p in backend/cpp/llama-cpp-localai-paged/patches/paged/0*.patch; do + git -C /tmp/llama-paged-series-applycheck apply --verbose "$PWD/$p" + done + ``` + + Result: + + - every patch applied successfully with `git apply`. + +## Task 2: Compare Tree Hash + +- [x] **Step 1: Compare applied tree to fork HEAD** + + Result: + + ```text + base=0ed235ea2c17a19fc8238668653946721ed136fd + applied_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb + fork_tree=5bdbf8ea3d750fe6fa1f85175fd6357d36222edb + ``` + + Canonical fork: + + - `/home/mudler/_git/llama.cpp` + - branch `localai-paged` + - HEAD `fb9402661 feat(server): trace speculative batch shapes` + +## Decision + +- [x] **Step 1: Mark mirror invariant green** + + The LocalAI `patches/paged/` series is a byte-for-byte source mirror of the + canonical fork branch at `fb9402661`. + +## Self-Review + +- No source or benchmark behavior changed. +- The check used the Makefile's strict `git apply` method, not `git am`. +- The temporary worktree was removed after verification. diff --git a/docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md b/docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md new file mode 100644 index 000000000000..874f9ee35cc8 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md @@ -0,0 +1,178 @@ +# Patch Series Mirror Readiness Phase69 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** prove the LocalAI paged patch series can be extended from `0063` to the current local fork HEAD without conflicts, while respecting the no-push-without-approval rule. + +**Architecture:** Do not edit generated patches yet. First verify the current on-disk series still matches the Phase37 fork tip, then export the missing commits into `/tmp`, apply current plus missing patches onto the pinned llama.cpp base, and compare that tree to the current local fork HEAD. + +**Tech Stack:** Git worktrees, `git apply`, `git format-patch`, LocalAI paged patch stack, llama.cpp fork branch `localai-paged`. + +--- + +## Guardrails + +- Do not push `mudler/llama.cpp:localai-paged` without explicit user approval. +- Do not edit `backend/cpp/llama-cpp-localai-paged/patches/paged/*.patch` directly. +- Do not regenerate committed LocalAI patch files before the fork push step required by the repo policy. +- Use strict `git apply`, matching the LocalAI build path. +- Record drift as a first-class phase result. + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +--- + +### Task 1: Verify Current Mirror Baseline + +- [x] **Step 1: Confirm current LocalAI state** + +Result: + +- LocalAI HEAD: `2b2b1f0b2 docs(paged): record BF16 F32 output dense serving phase` +- Untracked files: pre-existing `.claude/` scratch files only. +- Patch-series tail: `0063-feat-cuda-trace-cublas-tensor-names.patch`. + +- [x] **Step 2: Compare current patch series against Phase37 fork tip** + +Command shape: + +```bash +BASE=$(awk -F '?=' '/^LLAMA_VERSION/ {print $2}' backend/cpp/llama-cpp-localai-paged/Makefile) +CHECK=/tmp/llama-paged-series-applycheck-phase69 +git -C /home/mudler/_git/llama.cpp worktree add --detach "$CHECK" "$BASE" +for p in "$PWD"/backend/cpp/llama-cpp-localai-paged/patches/paged/0*.patch; do + git -C "$CHECK" apply --verbose "$p" +done +git -C "$CHECK" add -A +git -C "$CHECK" write-tree +git -C /home/mudler/_git/llama.cpp rev-parse 2d590d770^{tree} +``` + +Result: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +patch_tip_tree=dedb1182910eafe9f6875588dc8285bfb544cce5 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_patch_tip=yes +match_fork_head=no +patch_count=54 +``` + +Decision: the committed LocalAI series remains correct for Phase37, but it is +intentionally behind the local fork HEAD. + +### Task 2: Dry-Run Missing Patch Export + +- [x] **Step 1: Inspect fork divergence** + +Result: + +```text +upstream=fork/localai-paged +ahead_of_upstream=26 +ahead_of_patch_tip_2d590d770=10 +fork_head=ea0875d14225a10d87a1d0e1b9b57b74c81d873e +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +``` + +- [x] **Step 2: Export missing commits to `/tmp` only** + +Run: + +```bash +OUT=/tmp/phase69_missing_patches +rm -rf "$OUT" +mkdir -p "$OUT" +git -C /home/mudler/_git/llama.cpp format-patch \ + --zero-commit --no-signature --start-number 64 \ + -o "$OUT" 2d590d770..HEAD +``` + +Result: + +```text +0064-feat-server-trace-serving-admission-batches.patch +0065-feat-server-add-admission-trace-histograms.patch +0066-feat-server-add-TTFT-prefill-first-scheduler-mode.patch +0067-feat-server-cap-TTFT-prefill-first-decode-deferral.patch +0068-feat-server-gate-TTFT-defer-by-prompt-backlog.patch +0069-test-cuda-cover-W4A16-direct-activation-policy.patch +0070-feat-cuda-route-W4A16-direct-activation-stub.patch +0071-feat-cuda-trace-layout-tensor-names.patch +0072-feat-cuda-trace-activation-quant-routes.patch +0073-feat-cuda-gate-BF16-cuBLAS-F32-output.patch +``` + +- [x] **Step 3: Confirm source-only candidate paths** + +The temp patches touch only llama.cpp source, tests, CMake, and server files: + +```text +ggml/src/ggml-cuda/* +tests/* +tools/server/* +``` + +No markdown or LocalAI files are included in the generated candidate patches. + +### Task 3: Prove Full Projected Mirror + +- [x] **Step 1: Apply current plus temp patches to the pinned base** + +Command shape: + +```bash +BASE=$(awk -F '?=' '/^LLAMA_VERSION/ {print $2}' backend/cpp/llama-cpp-localai-paged/Makefile) +CHECK=/tmp/llama-paged-series-applycheck-phase69-full +git -C /home/mudler/_git/llama.cpp worktree add --detach "$CHECK" "$BASE" +for p in "$PWD"/backend/cpp/llama-cpp-localai-paged/patches/paged/0*.patch /tmp/phase69_missing_patches/*.patch; do + git -C "$CHECK" apply --verbose "$p" +done +git -C "$CHECK" add -A +``` + +Result: + +```text +base=0ed235ea2c17a19fc8238668653946721ed136fd +applied_plus_missing_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +fork_head_tree=fcf5720b659c5e1e2b487ccf3c8f7289bb12b9c4 +match_fork_head=yes +current_patch_count=54 +missing_patch_count=10 +projected_patch_count=64 +``` + +Decision: after push approval, the LocalAI patch-series regeneration path is +known: add temp-export-equivalent patches `0064..0073`, then verify the same tree +hash. The BF16 F32 opt-in is projected as patch `0073`. + +### Task 4: Record and Commit Documentation + +- [x] **Step 1: Record phase result** + +Update: + +- `backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md` +- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 2: Commit LocalAI docs** + +Run: + +```bash +git add -f docs/superpowers/plans/2026-07-01-patch-series-mirror-readiness-phase69.md +git add backend/cpp/llama-cpp-localai-paged/docs/PATCH_MAINTENANCE.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git commit -m "docs(paged): record patch mirror readiness phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md b/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md new file mode 100644 index 000000000000..634ce2091696 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md @@ -0,0 +1,108 @@ +# Persistent Gate Fusion Phase43 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Determine whether the Phase42 persistent/load-time F32 combined gate projection can be implemented as a low-conflict GB10 shortcut. + +**Architecture:** Inspect the Qwen35MoE tensor load and graph consumption paths, then decide whether to implement, reject, or rescope before source changes. This phase is a feasibility gate, not a production patch. + +**Tech Stack:** llama.cpp model loader, Qwen35MoE graph builder, GGUF tensor metadata, LocalAI parity docs. + +--- + +### Task 1: Inspect Gate Tensor Source Paths + +**Files:** +- Read: `/home/mudler/_git/llama.cpp/src/models/qwen35moe.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model-loader.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model.cpp` +- Read: `/home/mudler/_git/llama.cpp/src/llama-model.h` + +- [x] **Step 1: Locate tensor creation** + +Observed in `src/models/qwen35moe.cpp`: + +```cpp +layer.ffn_gate_inp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP, "weight", il), { n_embd, n_expert }, flags); +layer.ffn_gate_inp_shexp = create_tensor(tn(LLM_TENSOR_FFN_GATE_INP_SHEXP, "weight", il), { n_embd }, flags); +``` + +- [x] **Step 2: Locate tensor consumption** + +Observed: + +```cpp +build_moe_ffn(cur, model.layers[il].ffn_gate_inp, ...); +ggml_tensor * shared_gate = build_lora_mm(model.layers[il].ffn_gate_inp_shexp, cur); +``` + +- [x] **Step 3: Locate loader support for persistent derived tensors** + +Observed: + +```text +create_tensor(...) duplicates tensors from GGUF metadata. +create_tensor_as_view(...) can create views of existing GGUF tensors. +Backend buffers are allocated from loader contexts before load_all_data(...). +No existing helper creates a new persistent derived weight from two already-loaded tensors. +``` + +### Task 2: Make Feasibility Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Create: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md` + +- [x] **Step 1: Reject graph-time fallback** + +Decision: + +```text +Do not use ggml_concat() at graph time; Phase39 already rejected it because concat_layout is measurable in serving. +``` + +- [x] **Step 2: Reject Qwen-only loader hack** + +Decision: + +```text +Do not read both tensors back to host, allocate an extra backend weight buffer, and patch layer pointers after load. +That would create high conflict surface across mmap, offload, split buffers, MTP blocks, and state lifetime. +``` + +- [x] **Step 3: Record no-go** + +Decision: + +```text +Persistent/load-time fused gate projection is not a small GB10 shortcut. +It requires either a GGUF-exported combined weight or a general derived-weight facility in llama.cpp. +``` + +### Task 3: Verify and Commit + +**Files:** +- Modify: `docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md` + +- [x] **Step 1: Verify docs** + +Run: + +```bash +git diff --check +git status --short +``` + +- [x] **Step 2: Commit** + +Run: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-persistent-gate-fusion-phase43.md +git commit -m "docs(paged): reject persistent gate fusion shortcut" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md b/docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md new file mode 100644 index 000000000000..41c3840292fa --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md @@ -0,0 +1,409 @@ +# Prefill Bucket Attribution Phase63 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Re-profile current llama.cpp and vLLM MoE prefill on GB10 with inference gates before/after, then fund only a localized paged FlashAttention mask/block-table cleanup if the profile proves the bucket is material. + +**Architecture:** Phase63 is measurement-first. It brackets all DGX work with canonical md5 and backend-op gates, captures same-shape Nsight Systems prefill profiles for llama.cpp and vLLM, reduces kernel rows into named buckets, and records a go/no-go decision before touching llama.cpp source. If the FA/mask bucket is too small, the phase closes as a documented rejection. + +**Tech Stack:** LocalAI paged docs, llama.cpp CUDA backend, Nsight Systems, DGX `dgx.casa`, `/home/mudler/bench/bucket.py`, `llama-batched-bench`, vLLM offline profiling harness. + +--- + +## Guardrails + +- Do not edit llama.cpp source until Task 4 has a positive go decision. +- Do not regenerate the LocalAI patch series in this phase. +- Do not accept any md5 drift as benign without a separate KL decision. +- Canonical gates: + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `MUL_MAT`: `1146/1146` + - `MUL_MAT_ID`: `806/806` +- DGX preflight must show `docker=0`, `local_ai_worker=0`, `compute=0`, and a free lock before starting a run. + +## Files + +- Create: `docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/src/paged-attn.cpp` +- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-vec.cuh` +- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-tile.cuh` +- Read-only unless Task 4 is positive: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn.cu` +- Test if Task 4 is positive: `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp` + +--- + +### Task 1: Acquire DGX and Run Pre-Gates + +- [x] **Step 1: Verify DGX is idle and acquire the phase lock** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +docker_count=$(docker ps --format "{{.Names}}" | wc -l) +worker_count=$(pgrep -af "[l]ocal-ai-worker" | wc -l) +compute_count=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) +lock_state=FREE +if [ -f /tmp/localai-gb10.lock ]; then lock_state=$(cat /tmp/localai-gb10.lock); fi +printf "docker=%s local_ai_worker=%s compute=%s lock=%s\n" "$docker_count" "$worker_count" "$compute_count" "$lock_state" +test "$docker_count" = 0 +test "$worker_count" = 0 +test "$compute_count" = 0 +case "$lock_state" in FREE*|FREE-no-lock) : ;; *) exit 3 ;; esac +printf "codex-phase63-prefill-bucket %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +Expected: one line containing `docker=0 local_ai_worker=0 compute=0 lock=FREE...`, exit code `0`, and `/tmp/localai-gb10.lock` owned by `codex-phase63-prefill-bucket`. + +Result: initial preflight showed `docker=0`, `compute=0`, and no real +`local-ai-worker` process. The first direct gate retry exposed a shell issue: +with `set -euo pipefail`, an empty `pgrep` pipeline exits before printing, so the +execution command uses `(pgrep -af '[l]ocal-ai-worker' || true) | wc -l`. + +- [x] **Step 2: Run canonical pre-gate** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged +ART=/home/mudler/bench/phase63_prefill_bucket/$(date +%Y%m%d_%H%M%S) +mkdir -p "$ART" +echo "$ART" > /tmp/phase63_artifact_dir +./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/pre_gate" | tee "$ART/pre_gate.log"' +``` + +Expected: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 1146/1146 tests passed + 806/806 tests passed +paged inference gates OK +``` + +Result: + +```text +docker=0 local_ai_worker=0 compute=0 lock=FREE-no-lock +pre moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok +pre dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok +pre MUL_MAT 1146/1146 1146/1146 ok +pre MUL_MAT_ID 806/806 806/806 ok +paged inference gates OK +``` + +Artifact: `/home/mudler/bench/phase63_prefill_bucket/20260701_140127`. + +--- + +### Task 2: Capture Current llama.cpp Prefill Profiles + +- [x] **Step 1: Run `npp=512` and `npp=2048` llama.cpp prefill profiles** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$(cat /tmp/phase63_artifact_dir) +BIN=/home/mudler/llama-phase6-source/build-cuda/bin/llama-batched-bench +MODEL=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf +for npp in 512 2048; do + REP="$ART/llama_moe_prefill_npp${npp}" + rm -f "$REP.nsys-rep" "$REP.sqlite" "$REP.log" "$REP.buckets.txt" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 \ + nsys profile --trace=cuda --sample=none --cpuctxsw=none --force-overwrite true -o "$REP" \ + "$BIN" -m "$MODEL" -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp "$npp" -ntg 4 -npl 32 > "$REP.log" 2>&1 + nsys stats --report cuda_gpu_kern_sum --format csv --force-export true -o "$REP.kern" "$REP.nsys-rep" >/dev/null + python3 /home/mudler/bench/bucket.py "$REP.nsys-rep" "phase63_llama_npp${npp}" > "$REP.buckets.txt" + grep -E "main:|pp|tg|llama_print_timings|error|failed|CUDA" "$REP.log" | tail -40 > "$REP.summary.txt" || true +done' +``` + +Expected: + +- `llama_moe_prefill_npp512.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`. +- `llama_moe_prefill_npp2048.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`. +- Logs contain no `error`, `failed`, or CUDA runtime failure. + +Result: both profiles completed under +`/home/mudler/bench/phase63_prefill_bucket/20260701_140127`. + +- [x] **Step 2: Extract llama bucket rows for the decision table** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$(cat /tmp/phase63_artifact_dir) +for f in "$ART"/llama_moe_prefill_npp*.buckets.txt; do + echo "==== $f ====" + sed -n "/--- MACRO buckets ---/,/--- FINE buckets ---/p" "$f" + sed -n "/--- FINE buckets ---/,/--- top UNCLASSIFIED ---/p" "$f" | \ + egrep "mmq_nvfp4|act_quant|gdn_core|fa|argsort|mm_ids|gather_mmq|get_rows|copy_layout|concat_layout|convert_dtype" || true +done | tee "$ART/llama_bucket_extract.txt"' +``` + +Expected: extract includes rows for `MoE/FFN-GEMM`, `GDN`, `act-quant`, and `FA`; FA may be small. + +Result: + +| npp | MoE/FFN-GEMM | GDN | bf16-proj | layout-copy | act-quant | MoE-dispatch | gather | FA | +|-----|--------------|-----|-----------|-------------|-----------|--------------|--------|----| +| 512 | `40.48%` | `18.00%` | `10.19%` | `7.82%` | `4.47%` | `1.94%` | `1.26%` | `0.71%` | +| 2048 | `41.06%` | `16.15%` | `9.97%` | `7.96%` | `4.61%` | `2.12%` | `1.36%` | `1.18%` | + +The FA bucket is below the Phase63 reject threshold before any source work. + +--- + +### Task 3: Capture vLLM Same-Shape Prefill Profiles + +- [x] **Step 1: Run vLLM `PT=512` and `PT=2048` prefill profiles** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$(cat /tmp/phase63_artifact_dir) +export PATH=$HOME/vllm-bench/bin:$PATH +export HF_HUB_OFFLINE=1 +for pt in 512 2048; do + REP="$ART/vllm_moe_prefill_pt${pt}" + rm -f "$REP.nsys-rep" "$REP.sqlite" "$REP.log" "$REP.buckets.txt" + env NSEQ=32 PT="$pt" GEN=1 NREP=3 \ + nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop \ + --trace=cuda --sample=none --cpuctxsw=none --force-overwrite true -o "$REP" \ + $HOME/vllm-bench/bin/python /home/mudler/bench/vllm_prefill_prof.py > "$REP.log" 2>&1 + nsys stats --report cuda_gpu_kern_sum --format csv --force-export true -o "$REP.kern" "$REP.nsys-rep" >/dev/null + python3 /home/mudler/bench/bucket.py "$REP.nsys-rep" "phase63_vllm_pt${pt}" > "$REP.buckets.txt" + grep -E "TIMING|PROFILED|Error|Traceback|RuntimeError|CUDA" "$REP.log" | tail -40 > "$REP.summary.txt" || true +done' +``` + +Expected: + +- `vllm_moe_prefill_pt512.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`. +- `vllm_moe_prefill_pt2048.nsys-rep`, `.kern_cuda_gpu_kern_sum.csv`, `.buckets.txt`, `.log`. +- Logs contain `TIMING ... S_PP=...`, `PROFILED PREFILL START`, and `PROFILED END`. + +Result: both vLLM profiles completed under +`/home/mudler/bench/phase63_prefill_bucket/20260701_140127`. +Timing: + +| PT | S_PP | +|----|------| +| 512 | `5315.6 tok/s` | +| 2048 | `5384.4 tok/s` | + +- [x] **Step 2: Extract vLLM bucket rows for the decision table** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$(cat /tmp/phase63_artifact_dir) +for f in "$ART"/vllm_moe_prefill_pt*.buckets.txt; do + echo "==== $f ====" + sed -n "/--- MACRO buckets ---/,/--- FINE buckets ---/p" "$f" + sed -n "/--- FINE buckets ---/,/--- top UNCLASSIFIED ---/p" "$f" | \ + egrep "vllm_fa|fla_gdn|vllm_dispatch|vllm_fp4_gemm|torch_ew|rmsnorm|triton|scaled|quant" || true +done | tee "$ART/vllm_bucket_extract.txt"' +``` + +Expected: extract includes vLLM rows for `MoE/FFN-GEMM`, `GDN`, `FA`, and dispatch/glue. + +Result: + +| PT | ew(misc) | GDN | FA | bf16-proj | MoE-dispatch | top `other` rows | +|----|----------|-----|----|-----------|--------------|------------------| +| 512 | `32.97%` | `18.34%` | `0.73%` | `3.41%` | `1.37%` | Marlin MoE `1940.99ms`, FP8 projection `565.74ms` | +| 2048 | `33.48%` | `18.00%` | `1.75%` | `1.06%` | `0.49%` | Marlin MoE `7745.84ms`, FP8 projection `3047.75ms` | + +--- + +### Task 4: Decide Whether a Source Patch Is Funded + +- [x] **Step 1: Apply the Phase63 decision gate** + +Use these rules: + +- Continue to a source patch only if llama.cpp FA or paged-mask-related work is at least `8%` of prefill GPU kernel time at `npp>=2048`, or it accounts for at least `15 us/tok` versus vLLM at the same shape. +- Reject source work if FA is below `5%` of llama.cpp prefill kernel time at `npp=2048`. +- Reject source work if the profile again points primarily at already-rejected GDN, W4A16, MTP, small-M MMQ, or gate-projection buckets. +- If continuing, keep the source target limited to physical mask/block-table indexing for paged FlashAttention and an explicit `FLASH_ATTN_EXT` block-table backend-op test. + +Expected: write a short decision paragraph into `GB10_PARITY_PHASE0_RESULTS.md`. + +Result: reject source work for Phase63. llama.cpp FA was `0.71%` at `npp=512` +and `1.18%` at `npp=2048`, below the `<5%` source-work reject threshold. At +`npp=2048`, llama FA was `320.66ms` over `65536` prompt tokens, about +`4.9 us/tok`; vLLM FA was `618.02ms` over `196608` prompt tokens, about +`3.1 us/tok`. The approximate FA delta is only `1.7 us/tok`, below the +`15 us/tok` source-funding gate. + +- [x] **Step 2: If the source gate is negative, skip directly to Task 6** + +Expected: no source files modified. + +Result: no llama.cpp source files were modified. + +--- + +### Task 5: Optional Source Patch Only If Task 4 Is Positive + +Skipped: Task 4 rejected source work. + +- [ ] **Step 1: Add the missing block-table FlashAttention backend-op case first** + +Modify `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp` so `FLASH_ATTN_EXT` has a paged/block-table mask case that fails before any mask-indexing implementation. + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +cd /home/mudler/llama-phase6-source +cmake --build build-cuda --target test-backend-ops -j $(nproc) +./build-cuda/bin/test-backend-ops test -b CUDA0 -o FLASH_ATTN_EXT -j 1' +``` + +Expected before implementation: the new block-table case fails or is skipped with an explicit unsupported path that proves the gap. + +- [ ] **Step 2: Implement physical mask indexing behind the existing block-table dispatch** + +Modify only the narrow paged-FA files: + +- `/home/mudler/_git/llama.cpp/src/paged-attn.cpp` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-vec.cuh` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn-tile.cuh` +- `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fattn.cu` + +The implementation must remove mask compaction only when a block table is present and the CUDA kernel is using the physical-mask path. Non-paged attention must keep the existing mask layout. + +- [ ] **Step 3: Run correctness and inference gates** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +cd /home/mudler/llama-phase6-source +cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc) +ART=$(cat /tmp/phase63_artifact_dir) +./build-cuda/bin/test-backend-ops test -b CUDA0 -o FLASH_ATTN_EXT -j 1 | tee "$ART/flash_attn_ext_post.log" +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged +./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/post_patch_gate" | tee "$ART/post_patch_gate.log"' +``` + +Expected: `FLASH_ATTN_EXT` passes, canonical md5s match, `MUL_MAT` is `1146/1146`, and `MUL_MAT_ID` is `806/806`. + +- [ ] **Step 4: Run the A/B performance gate** + +Run baseline and patched builds with: + +```bash +env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 ./llama-batched-bench \ + -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -c 131072 -b 2048 -ub 512 -ngl 99 -fa on \ + -npp 128 -ntg 128 -npl 128,256 +``` + +Keep only if the patch improves decode `S_TG` by at least `1.0%` at `npl=128` or `npl=256`, or reduces graph-node-traced decode wall by at least `0.5 ms/step`, with no md5/op drift. + +--- + +### Task 6: Post-Gate, Release DGX, and Record Result + +- [x] **Step 1: Run canonical post-gate** + +Run: + +```bash +ssh dgx.casa 'set -euo pipefail +ART=$(cat /tmp/phase63_artifact_dir) +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged +./scripts/paged-inference-gates.sh /home/mudler/llama-phase6-source/build-cuda/bin "$ART/post_gate" | tee "$ART/post_gate.log"' +``` + +Expected: + +```text +moe md5 OK: 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 OK: 5951a5b4d624ce891e22ab5fca9bc439 + 1146/1146 tests passed + 806/806 tests passed +paged inference gates OK +``` + +Result: + +```text +post moe_md5 8cb0ce23777bf55f92f63d0292c756b0 8cb0ce23777bf55f92f63d0292c756b0 ok +post dense_md5 5951a5b4d624ce891e22ab5fca9bc439 5951a5b4d624ce891e22ab5fca9bc439 ok +post MUL_MAT 1146/1146 1146/1146 ok +post MUL_MAT_ID 806/806 806/806 ok +post paged inference gates OK +``` + +- [x] **Step 2: Release DGX lock** + +Run: + +```bash +ssh dgx.casa 'printf "FREE released-by-codex-phase63-prefill-bucket %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +Expected: `/tmp/localai-gb10.lock` starts with `FREE released-by-codex-phase63-prefill-bucket`. + +Result: `/tmp/localai-gb10.lock` is +`FREE released-by-codex-phase63-prefill-bucket 1782908317`; Docker count `0`, +worker count `0`, and no compute-app rows. + +- [x] **Step 3: Update LocalAI docs** + +Modify: + +- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +Record: + +- artifact directory, +- pre/post gate md5s and op counts, +- llama and vLLM bucket table, +- Task 4 decision, +- source patch commit if any, or explicit source-work rejection. + +Result: completed in this commit. + +- [x] **Step 4: Commit LocalAI tracking docs** + +Run: + +```bash +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention +git add -f docs/superpowers/plans/2026-07-01-prefill-bucket-attribution-phase63.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record prefill bucket attribution phase" \ + -m "Assisted-by: Codex:gpt-5" +``` + +Expected: commit succeeds without bypassing hooks. + +Result: committed as `6fc2cfb54 docs(paged): record prefill bucket attribution +phase`, then amended to mark this final checklist item complete. + +--- + +## Self-Review + +- Spec coverage: The plan directly covers the user's inferencing-safety request with pre/post md5 and op gates, uses DGX only after idle preflight, scopes Phase63 as measurement before source work, and limits any source follow-up to a localized FA/mask candidate. +- Placeholder scan: No `TBD`, `TODO`, or undefined test command remains. +- Type/path consistency: Artifact path, gate command, model paths, and binary paths are consistent across tasks. diff --git a/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md b/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md new file mode 100644 index 000000000000..61a8a1e030e0 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md @@ -0,0 +1,107 @@ +# Quant Kernel Timing Phase66 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Time the Phase65 activation-quant candidate kernels directly and decide whether a source optimization is funded. + +**Architecture:** Use the already-gated Phase65 llama.cpp binary on DGX and collect an Nsight Systems CUDA kernel summary for the same MoE `npp=512`, `ntg=4`, `npl=32` prefill shape. Compare `quantize_mmq_nvfp4` and `gather_mmq_fp4` against total GPU kernel time. + +**Tech Stack:** llama.cpp CUDA backend, Nsight Systems 2025.3.2, DGX GB10 benchmark host, LocalAI parity docs. + +--- + +## Files + +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: DGX Profile + +- [x] **Step 1: Confirm DGX is idle** + +Observed before profiling: lock `FREE`, Docker `0`, `local-ai-worker` `0`, +compute apps `0`. + +- [x] **Step 2: Acquire lock** + +Observed lock owner: `codex-phase66-quant-kernel-timing 1782909776`. + +- [x] **Step 3: Run Nsight Systems profile** + +Artifact: `/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256`. + +Command shape: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 \ + nsys profile --trace=cuda,nvtx --cuda-graph-trace=node --force-overwrite=true \ + --sample=none --cpuctxsw=none \ + -o "$ART/quant_npp512" \ + ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512 -ntg 4 -npl 32 +``` + +- [x] **Step 4: Generate CUDA kernel summary** + +Generated: + +```text +/home/mudler/bench/phase66_quant_kernel_timing/20260701_144256/quant_npp512_kern_sum_cuda_gpu_kern_sum.csv +``` + +--- + +### Task 2: Decide + +- [x] **Step 1: Extract candidate kernel timing** + +Observed total GPU kernel time: `7108388986 ns`. + +| kernel | time | instances | share | +|--------|-----:|----------:|------:| +| `quantize_mmq_nvfp4` | `317205504 ns` | `8884` | `4.46%` | +| `gather_mmq_fp4` | `45374880 ns` | `2960` | `0.64%` | +| combined | `362580384 ns` | - | `5.10%` | + +- [x] **Step 2: Source decision** + +Reject a Phase66 gather/quant source optimization. `gather_mmq_fp4` is not +material, and `quantize_mmq_nvfp4 + gather_mmq_fp4` is below the `8%` source +funding threshold for this profiled shape. A W4A16/no-activation-quant rewrite +has already been rejected in earlier phases, so do not reopen it from this data. + +- [x] **Step 3: Release lock** + +Observed release state: + +```text +FREE released-by-codex-phase66-quant-kernel-timing 1782909826 +docker=0 +local_ai_worker=0 +compute_apps=0 +``` + +--- + +### Task 3: Commit and Record + +- [x] **Step 1: Record LocalAI docs** + +This plan and parity docs record the Phase66 no-go decision. + +- [x] **Step 2: Commit LocalAI docs** + +Expected commit: + +```bash +git add -f docs/superpowers/plans/2026-07-01-quant-kernel-timing-phase66.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record quant kernel timing phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-quant-trace-phase65.md b/docs/superpowers/plans/2026-07-01-quant-trace-phase65.md new file mode 100644 index 000000000000..ce3e2f7d4849 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-quant-trace-phase65.md @@ -0,0 +1,277 @@ +# Quant Trace Phase65 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Attribute the remaining activation-quant and FP4 prefill quantization bucket with a default-off llama.cpp diagnostic patch, without changing inferencing by default. + +**Architecture:** Add bounded stderr tracing at the CUDA call sites that launch activation quantization for MMQ and native large-M FP4 prefill. The trace records route, tensor names, tensor shapes, dedup/gather status, and padded K/M dimensions so Phase65 can decide whether a real source optimization is funded. + +**Tech Stack:** llama.cpp CUDA backend, LocalAI parity docs, DGX GB10 benchmark host, canonical md5 and `test-backend-ops` gates. + +--- + +## Guardrails + +- Do not change default inferencing behavior. `LLAMA_QUANT_TRACE` unset or `0` must only add inert helper code. +- Keep the source patch small and incremental. Prefer local helper functions in existing CUDA files over new cross-file abstractions. +- Gate every source change with: + - MoE paged md5: `8cb0ce23777bf55f92f63d0292c756b0` + - dense md5: `5951a5b4d624ce891e22ab5fca9bc439` + - `test-backend-ops` `MUL_MAT` all passed + - `test-backend-ops` `MUL_MAT_ID` all passed +- Do not regenerate LocalAI patch files in this phase unless explicitly approved. +- Do not push without explicit approval. + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/fp4-gemm.cu` +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-quant-trace-phase65.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +--- + +### Task 1: Add MMQ Quant Trace + +- [x] **Step 1: Add default-off trace helpers to `mmq.cu`** + +Add local helpers near the top of `ggml/src/ggml-cuda/mmq.cu`: + +```c++ +static inline int ggml_cuda_quant_trace_limit(); +static inline const char * ggml_cuda_quant_trace_tensor_name(const ggml_tensor * t); +static inline void ggml_cuda_quant_trace( + const char * route, const ggml_tensor * src0, const ggml_tensor * src1, + const ggml_tensor * ids, const ggml_tensor * dst, int native_fp4, + int dedup, int gathered, int64_t ne10, int64_t ne10_padded, + int64_t rows, int64_t ne12, int64_t n_expert_used); +``` + +The helper reads `LLAMA_QUANT_TRACE`, uses a static atomic counter, and prints one line per trace: + +```text +[LLAMA_QUANT_TRACE] route=... src0=... src0_type=... src1=... dst=... ids=... native_fp4=... dedup=... gathered=... K=... Kpad=... rows=... ne12=... experts=... +``` + +- [x] **Step 2: Trace dense MMQ quantization** + +Before the dense `quantize_mmq_fp4_cuda` or `quantize_mmq_q8_1_cuda` call, emit: + +```c++ +ggml_cuda_quant_trace("mmq_dense", src0, src1, ids, dst, use_native_fp4 ? 1 : 0, + 0, 0, ne10, ne10_padded, ne11, ne12, 0); +``` + +- [x] **Step 3: Trace MoE MMQ quantization paths** + +In the `ids` path: + +```c++ +ggml_cuda_quant_trace("mmq_moe_dedup_unique", src0, src1, ids, dst, use_native_fp4 ? 1 : 0, + 1, 0, ne10, ne10_padded, ne12, ne12, n_expert_used); +ggml_cuda_quant_trace("mmq_moe_gather", src0, src1, ids, dst, use_native_fp4 ? 1 : 0, + 1, 1, ne10, ne10_padded, ne11_flat, ne12, n_expert_used); +ggml_cuda_quant_trace("mmq_moe_flat", src0, src1, ids, dst, use_native_fp4 ? 1 : 0, + 0, 0, ne10, ne10_padded, ne11_flat, ne12, n_expert_used); +``` + +Only emit the route that is actually launched. + +- [x] **Step 4: Run local syntax checks** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: exit `0`. + +--- + +### Task 2: Add Native FP4 Prefill Quant Trace + +- [x] **Step 1: Add local helpers to `fp4-gemm.cu`** + +Add a small `LLAMA_QUANT_TRACE` helper near `ggml_cuda_fp4_prefill_m()` that prints route `fp4_prefill_act_split` with `src0`, `src1`, `dst`, `K`, `M`, `Mpad`, and `Kb`. + +- [x] **Step 2: Emit before `fp4_quantize_act_split`** + +In `ggml_cuda_mul_mat_fp4_large_m`, emit the trace immediately before the activation split launch: + +```c++ +ggml_cuda_fp4_quant_trace("fp4_prefill_act_split", src0, src1, dst, K, M, Mpad, Kb); +``` + +- [x] **Step 3: Run local syntax checks** + +Run: + +```bash +git -C /home/mudler/_git/llama.cpp diff --check +``` + +Expected: exit `0`. + +--- + +### Task 3: DGX Build and Gates + +- [x] **Step 1: Confirm DGX is idle** + +Run: + +```bash +ssh dgx.casa 'cat /tmp/localai-gb10.lock 2>/dev/null || true; docker ps --format "{{.Names}}" | wc -l; (pgrep -af "[l]ocal-ai-worker" || true) | wc -l; nvidia-smi --query-compute-apps=pid,process_name,used_gpu_memory --format=csv,noheader | wc -l' +``` + +Expected: lock `FREE*`, docker `0`, worker `0`, compute apps `0`. + +- [x] **Step 2: Acquire the lock** + +Run: + +```bash +ssh dgx.casa 'printf "codex-phase65-quant-trace %s\n" "$(date +%s)" > /tmp/localai-gb10.lock; cat /tmp/localai-gb10.lock' +``` + +- [x] **Step 3: Apply patch and build on DGX** + +Run the existing phase-source mirror flow for `/home/mudler/llama-phase6-source`, then: + +```bash +ssh dgx.casa 'cd /home/mudler/llama-phase6-source && cmake --build build-cuda --target llama-completion llama-batched-bench test-backend-ops -j $(nproc)' +``` + +Expected: exit `0`. + +- [x] **Step 4: Run inference and op gates** + +Run the canonical MoE and dense md5 commands plus: + +```bash +./test-backend-ops test -o MUL_MAT +./test-backend-ops test -o MUL_MAT_ID +``` + +Expected: + +```text +MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 5951a5b4d624ce891e22ab5fca9bc439 +MUL_MAT all passed +MUL_MAT_ID all passed +``` + +Result artifact: `/home/mudler/bench/phase65_quant_trace/20260701_143729`. + +Observed: + +```text +MoE md5 8cb0ce23777bf55f92f63d0292c756b0 +dense md5 5951a5b4d624ce891e22ab5fca9bc439 +MUL_MAT 1146/1146 +MUL_MAT_ID 806/806 +``` + +--- + +### Task 4: Trace and Decide + +- [x] **Step 1: Run bounded quant trace** + +Run MoE prefill with graphs disabled for log readability: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 GGML_CUDA_DISABLE_GRAPHS=1 LLAMA_QUANT_TRACE=12000 \ + ./llama-batched-bench -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf \ + -c 131072 -b 2048 -ub 512 -ngl 99 -fa on -npp 512 -ntg 4 -npl 32 +``` + +- [x] **Step 2: Summarize trace routes** + +Expected summary keys: + +```text +mmq_dense +mmq_moe_dedup_unique +mmq_moe_gather +mmq_moe_flat +fp4_prefill_act_split +``` + +Observed default-path route counts: + +| route | lines | +|-------|------:| +| `mmq_dense` | `4444` | +| `mmq_moe_dedup_unique` | `2960` | +| `mmq_moe_gather` | `2960` | +| `mmq_moe_flat` | `1480` | + +Dominant `npp=512` shapes: + +| count | route | source family | K | rows | ne12 | +|------:|-------|---------------|---:|-----:|-----:| +| `2560` | `mmq_moe_dedup_unique` | gate/up experts | `2048` | `512` | `512` | +| `2560` | `mmq_moe_gather` | gate/up experts | `2048` | `4096` | `512` | +| `2560` | `mmq_dense` | shared expert gate/up | `2048` | `512` | `1` | +| `1280` | `mmq_moe_flat` | down experts | `512` | `4096` | `512` | +| `1280` | `mmq_dense` | shared expert down | `512` | `512` | `1` | + +`fp4_prefill_act_split` did not appear in the default trace because the native +large-M FP4 prefill route remains opt-in. + +- [x] **Step 3: Source decision** + +Fund a Phase66 source optimization only if one route is repeated, named, and material enough to plausibly remove at least `8%` of llama.cpp prefill time or at least `15 us/tok` cross-engine gap. Otherwise close Phase65 as attribution-only. + +Decision: keep Phase65 as instrumentation plus attribution. Do not implement a +quantization optimization directly from route counts. Phase66 should first time +`quantize_mmq_nvfp4` versus `gather_mmq_fp4` with nsys/NVTX, because the trace +shows a repeated MoE gate/up dedup-and-gather chain but does not prove whether +the gather is the material part or just a cheap consequence of the existing +dedup optimization. + +- [x] **Step 4: Release DGX lock** + +Run: + +```bash +ssh dgx.casa 'printf "FREE released-by-codex-phase65-quant-trace %s\n" "$(date +%s)" > /tmp/localai-gb10.lock' +``` + +--- + +### Task 5: Commit and Record + +- [x] **Step 1: Commit llama.cpp source patch** + +Commit only after build and gates pass: + +```bash +git -C /home/mudler/_git/llama.cpp add ggml/src/ggml-cuda/mmq.cu ggml/src/ggml-cuda/fp4-gemm.cu +git -C /home/mudler/_git/llama.cpp commit -m "feat(cuda): trace activation quant routes" -m "Assisted-by: Codex:gpt-5" +``` + +Result: + +- Local fork: `afc2c7030 feat(cuda): trace activation quant routes` +- DGX mirror: `7863194bd feat(cuda): trace activation quant routes` + +- [x] **Step 2: Record LocalAI docs** + +Update parity docs with artifact path, gate values, route distribution, and Phase66 decision. + +- [x] **Step 3: Commit LocalAI docs** + +```bash +git add -f docs/superpowers/plans/2026-07-01-quant-trace-phase65.md +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md +git commit -m "docs(paged): record quant trace phase" \ + -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-served-model-name-phase46.md b/docs/superpowers/plans/2026-07-01-served-model-name-phase46.md new file mode 100644 index 000000000000..10520f6dfa3e --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-served-model-name-phase46.md @@ -0,0 +1,155 @@ +# Phase46 Served Model Name Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Let the audited serving snapshot harness run MoE, dense, or hardware-pivot model variants without hardcoded `q36` model names. + +**Architecture:** Add a single `SERVED_MODEL_NAME` environment variable to `paged-current-serving-snapshot.sh`, defaulting to `q36`. Use it consistently for vLLM `--served-model-name`, vLLM model readiness checks, and h2h `--model` requests on both engines. Print it in `DRY_RUN=1` output so hardware-pivot runs can be audited before launching servers. + +**Tech Stack:** Bash serving harness, DGX dry-run preflight, LocalAI parity docs. + +--- + +### Task 1: Prove the override is missing + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run help-text red check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'SERVED_MODEL_NAME' +``` + +Expected: exit `1`, because the harness does not document the model-name override yet. + +- [x] **Step 2: Run DGX dry-run red check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase46_served_model_name_dryrun_red/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 SERVED_MODEL_NAME=dense-q36 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh | grep -F 'SERVED_MODEL_NAME=dense-q36' +``` + +Expected: exit `1`, because `DRY_RUN=1` does not print the served model name yet. + +### Task 2: Add `SERVED_MODEL_NAME` + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Document the variable** + +Add this line after `VLLM_MODEL`: + +```bash + SERVED_MODEL_NAME OpenAI model name used by llama-server, vLLM, and h2h (default: q36) +``` + +- [x] **Step 2: Add the default** + +Add this assignment after `VLLM_MODEL`: + +```bash +SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-q36} +``` + +- [x] **Step 3: Replace hardcoded h2h model names** + +Replace every h2h `--model q36` with: + +```bash +--model "$SERVED_MODEL_NAME" +``` + +- [x] **Step 4: Replace hardcoded vLLM model name and readiness check** + +Replace: + +```bash +--served-model-name q36 +wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "q36" ... +``` + +with: + +```bash +--served-model-name "$SERVED_MODEL_NAME" +wait_http "http://127.0.0.1:$VLLM_PORT/v1/models" "$SERVED_MODEL_NAME" ... +``` + +- [x] **Step 5: Print it in dry-run output** + +Add: + +```bash +log "served model: SERVED_MODEL_NAME=$SERVED_MODEL_NAME" +``` + +### Task 3: Verify the harness + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Help-text green check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'SERVED_MODEL_NAME' +``` + +Expected: exit `0`. + +- [x] **Step 3: DGX dry-run green check** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase46_served_model_name_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 SERVED_MODEL_NAME=dense-q36 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, clean preflight, and output includes `SERVED_MODEL_NAME=dense-q36`. + +### Task 4: Record Phase46 + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-served-model-name-phase46.md` + +- [x] **Step 1: Append the Phase46 result** + +Record that this is harness-only hardware-pivot readiness and cite the DGX dry-run artifact. + +- [x] **Step 2: Mark all completed plan items** + +Mark this file's remaining task checkboxes complete only after the corresponding command or docs update has happened. + +### Task 5: Commit + +**Files:** +- Commit Phase46 script, docs, and plan changes. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-served-model-name-phase46.md +git commit -m "feat(paged): parameterize served model name" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md b/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md new file mode 100644 index 000000000000..daf712056fec --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-serving-admission-trace-phase51.md @@ -0,0 +1,140 @@ +# Phase51 Serving Admission Trace Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add an opt-in llama.cpp server trace that reports serving batch admission shape so dense high-N TTFT/aggregate gaps can be separated from true GPU decode speed. + +**Architecture:** Implement fork-first on `mudler/llama.cpp:localai-paged`. Keep inference behavior unchanged by gating the trace behind `LLAMA_SERVING_TRACE`. Add a small unit-tested formatter/accumulator and wire counters into `server_context_impl::pre_decode()` without changing scheduling predicates. + +**Tech Stack:** llama.cpp fork, `tools/server/server-context.cpp`, CMake unit test, DGX GB10 `build-cuda`, canonical md5 and backend-op gates. + +--- + +### Task 1: Add red unit test + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` +- Create: `/home/mudler/_git/llama.cpp/tests/test-server-admission-trace.cpp` + +- [x] **Step 1: Add the test target and assertions** + +Added `test-server-admission-trace.cpp`, asserting summary output includes +`steps`, `decode_only_steps`, `decode_tokens`, `prompt_tokens`, +`max_waiting_prompt_slots`, `started_prompt_slots`, `continued_prompt_slots`, +`last_n_batch`, `last_n_ubatch`, `last_prefill_budget_step`, and +`last_prefill_cap_per_slot`. + +- [x] **Step 2: Verify red** + +Run: + +```bash +cmake -S . -B build >/tmp/llama-phase51-cmake.log +cmake --build build --target test-server-admission-trace -j2 +``` + +Expected and observed: build failed because +`../tools/server/server-admission-trace.h` did not exist. + +### Task 2: Implement opt-in trace + +**Files:** +- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-trace.h` +- Modify: `/home/mudler/_git/llama.cpp/tools/server/CMakeLists.txt` +- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + +- [x] **Step 1: Add accumulator and formatter** + +Added `server_admission_trace_step`, `server_admission_trace_totals`, +`server_admission_trace_accumulate()`, and `server_admission_trace_format()`. + +- [x] **Step 2: Wire counters into `pre_decode()`** + +`LLAMA_SERVING_TRACE=1` now tracks: + +- decode tokens already in the batch +- prompt tokens admitted this step +- waiting prompt slots seen by the prompt-admission loop +- started and continued prompt slots that actually admitted prompt tokens +- decode-only steps +- `n_batch`, `n_ubatch`, `prefill_budget_step`, and `prefill_cap_per_slot` + +The trace is printed once from `server_context_impl` destruction when enabled +and at least one step was observed. + +### Task 3: Verify locally and on DGX + +**Files:** +- DGX artifact: `/home/mudler/bench/phase51_serving_admission_trace/20260701_110130` + +- [x] **Step 1: Run local unit and server build** + +Commands: + +```bash +cmake -S . -B build >/tmp/llama-phase51-cmake.log +cmake --build build --target test-server-admission-trace -j2 +./build/bin/test-server-admission-trace +cmake --build build --target llama-server -j2 +ctest --test-dir build -R '^test-server-admission-trace$' --output-on-failure +``` + +Observed: unit test passed, `llama-server` built, CTest passed. + +- [x] **Step 2: Apply patch to DGX mirror and build** + +Applied the local patch to `dgx:~/llama-phase6-source`, then ran: + +```bash +cmake -S . -B build-cuda +cmake --build build-cuda --target test-server-admission-trace llama-server -j2 +ctest --test-dir build-cuda -R '^test-server-admission-trace$' --output-on-failure +``` + +Observed: DGX CTest passed. + +- [x] **Step 3: Run canonical inference gate** + +Run: + +```bash +BIN=$HOME/llama-phase6-source/build-cuda/bin \ +ART=$HOME/bench/phase51_serving_admission_trace/20260701_110130/gate_post \ +OPS=MUL_MAT,MUL_MAT_ID \ + $HOME/paged-inference-gates.sh +``` + +Observed: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +### Task 4: Commit and mirror + +**Files:** +- Modify later: `backend/cpp/llama-cpp-localai-paged/patches/paged/` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Commit on the llama.cpp fork** + +Local fork commit: + +```text +c6cb8460e feat(server): trace serving admission batches +``` + +- [ ] **Step 2: Push fork branch** + +Blocked by policy: ask before every push. Do not push without explicit approval. + +- [ ] **Step 3: Regenerate LocalAI patch series** + +Pending until the fork branch is pushed, per the fork-first mirror invariant. + +- [x] **Step 4: Record Phase51 status in LocalAI docs** + +Record the fork commit, DGX artifact, gates, and pending push/mirror state. diff --git a/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md b/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md new file mode 100644 index 000000000000..d8ec2b8f4128 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md @@ -0,0 +1,165 @@ +# Phase48 Serving Harness Readiness Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Make the audited serving snapshot harness robust to slow vLLM dense startup and non-exiting server processes. + +**Architecture:** Keep the fix local to `paged-current-serving-snapshot.sh`: add separate llama/vLLM readiness budgets, bound each HTTP probe with `curl --max-time`, and replace unbounded server cleanup waits with a short graceful wait followed by `SIGKILL`. + +**Tech Stack:** Bash harness, DGX dry-run, LocalAI parity docs. + +--- + +### Task 1: Prove the robustness controls are absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run readiness-budget red check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_READY_ATTEMPTS' +``` + +Expected: exit `1`. + +- [x] **Step 2: Run bounded-curl red check** + +```bash +grep -F 'curl --max-time' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +- [x] **Step 3: Run cleanup hard-kill red check** + +```bash +grep -F 'kill -9 "$SERVER_PID"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +### Task 2: Patch readiness and cleanup + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Add documented environment variables** + +Add: + +```bash + LLAMA_READY_ATTEMPTS llama-server readiness attempts, one per second (default: 240) + VLLM_READY_ATTEMPTS vLLM readiness attempts, one per second (default: 600) +``` + +- [x] **Step 2: Add defaults** + +```bash +LLAMA_READY_ATTEMPTS=${LLAMA_READY_ATTEMPTS:-240} +VLLM_READY_ATTEMPTS=${VLLM_READY_ATTEMPTS:-600} +``` + +- [x] **Step 3: Bound HTTP probes** + +Change `wait_http()` to accept an attempts argument and run: + +```bash +curl --max-time 2 -fsS "$url" > "$health" 2>"$health.err" +``` + +- [x] **Step 4: Use per-server readiness budgets** + +Call `wait_http` with `$LLAMA_READY_ATTEMPTS` for llama-server and `$VLLM_READY_ATTEMPTS` for vLLM. + +- [x] **Step 5: Add bounded process cleanup** + +Create `stop_server_pid()` that sends `SIGTERM`, waits up to 30 seconds, sends `SIGKILL` if needed, and only then calls `wait`. + +### Task 3: Verify the harness fix + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Help-text green check** + +```bash +backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help | grep -F 'VLLM_READY_ATTEMPTS' +``` + +Expected: exit `0`. + +- [x] **Step 3: Bounded-curl green check** + +```bash +grep -F 'curl --max-time 2 -fsS "$url"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 4: Cleanup hard-kill green check** + +```bash +grep -F 'kill -9 "$SERVER_PID"' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 5: DGX dry-run with long vLLM readiness budget** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase48_readiness_harness_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_READY_ATTEMPTS=700 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, clean preflight, and dry-run output includes `VLLM_READY_ATTEMPTS=700`. + +### Task 4: Record Phase48 and failed Phase47 attempt + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md` +- Modify: `docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md` + +- [x] **Step 1: Record Phase47 as failed/incomplete** + +Record the partial artifact and the root cause: vLLM dense startup exceeded the old 240-attempt readiness budget, and cleanup could hang waiting on the server PID. + +- [x] **Step 2: Record Phase48 fix** + +Record the new readiness variables, bounded curl probe, bounded cleanup, and dry-run artifact. + +### Task 5: Commit + +**Files:** +- Commit the Phase48 harness, docs, and plan changes. + +- [x] **Step 1: Run final checks** + +```bash +git diff --check +git status --short +``` + +Expected: no whitespace errors; only intended files changed plus the pre-existing untracked `.claude/`. + +- [x] **Step 2: Commit** + +```bash +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-dense-serving-snapshot-phase47.md \ + docs/superpowers/plans/2026-07-01-serving-harness-readiness-phase48.md +git commit -m "fix(paged): harden serving snapshot readiness" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md new file mode 100644 index 000000000000..c91fccdfef12 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md @@ -0,0 +1,482 @@ +# Phase 8 Ragged MoE Dispatch Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Decide whether GB10 serving parity should target a fused routed-expert `MUL_MAT_ID` dispatch path for ragged MoE decode, then implement only if profiling proves the bucket is material. + +**Architecture:** Phase 8 is profile-gated. First decompose serving decode into routing compaction, activation quant/gather, grouped MMQ, scatter/fan-in, GDN, and FA buckets. Only if the `MUL_MAT_ID` routing/compaction/MMQ bucket expands materially in live ragged serving do we add a default-off fused-dispatch candidate in llama.cpp. + +**Tech Stack:** llama.cpp CUDA backend, Nsight Systems, `/home/mudler/bench/bucket.py`, LocalAI paged patch mirror, GB10 DGX host `dgx.casa`. + +--- + +## Context + +Rejected Phase 7 shortcuts: + +- SWIGLU-down NVFP4 quantization fusion: focused op gate passed, but opt-in + paged-MoE md5 changed and serving A/B was flat. +- Post-down weighted-combine fan-in fusion: md5-safe and Nsight-proven to fire, + but serving A/B was flat (`decode_agg_tps 417.5 -> 417.0`). + +Deferred non-default work: + +- Backend sampler logit-bias upload caching is real but only applies to + `--backend-sampling` with request `backend_sampling: true` and non-empty + `logit_bias` or `ignore_eos`. It is not a default greedy parity lever. + +Selected Phase 8 candidate: + +- Fused routed-expert `MUL_MAT_ID` dispatch for ragged serving decode. +- This is distinct from fan-in-only fusion because it attacks the earlier chain: + `mm_ids_helper -> activation quant/gather -> grouped MMQ -> dst scatter`. + +## File Map + +- Read/profile only: + - `/home/mudler/_git/llama.cpp/src/llama-graph.cpp` + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/topk-moe.cu` + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmid.cu` + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` + - `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` +- If promoted to source: + - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmid.cu` + - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` + - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` + - Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + - Test: `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp` +- Tracking docs: + - Modify: + `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md` + - Modify: + `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - Modify: + `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +## Required Safety Gates + +- Before DGX work: + - `docker ps -q | wc -l` must be `0`. + - no `local-ai-worker` container may be running. + - `nvidia-smi --query-compute-apps=pid --format=csv,noheader` must be empty. + - `~/gpu_bench_lock/owner` must be absent or start with `FREE`. +- Before keeping any source patch: + - MoE transcript md5 must be `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense transcript md5 must be `5951a5b4d624ce891e22ab5fca9bc439`. + - `test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1` must report `806/806`. + - If adding a specific ragged op test, it must include `n_expert=256`, + `n_expert_used=8`, single-token decode, empty experts, ragged expert loads, + and `ne2 > get_mmvq_mmid_max_batch(...)`. + - CUDA graph replay must still work with `LLAMA_MOE_FORCE_GRAPHS=1`. + - Source candidate must be default-off first, e.g. + `LLAMA_MOE_FUSED_DISPATCH=1`. + - No D2H id readback or new `cudaStreamSynchronize` may enter the decode path. + +## Task 1: Profile-Gate Ragged MoE Dispatch + +**Files:** +- Modify: + `docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md` +- Modify: + `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + +- [x] **Step 1: Record Phase 8 scope** + + Write this plan and commit it before source work. + +- [x] **Step 2: Reconfirm DGX idle state** + + Run: + + ```bash + ssh dgx.casa 'set -e + echo docker=$(docker ps -q | wc -l) + echo local_ai_worker=$(docker ps --format "{{.Names}}" | grep -c local-ai-worker || true) + echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l) + if [ -f ~/gpu_bench_lock/owner ]; then cat ~/gpu_bench_lock/owner; else echo FREE-no-lock-file; fi' + ``` + + Expected: + + ```text + docker=0 + local_ai_worker=0 + compute=0 + FREE... + ``` + +- [x] **Step 3: Run serving nsys for llama.cpp MoE** + + Run on DGX: + + ```bash + ssh dgx.casa 'cat > /tmp/phase8_llama_nsys.sh <<'"'"'SH'"'"' + #!/usr/bin/env bash + set -euo pipefail + ART=$HOME/bench/phase8_ragged_moe_dispatch/llama_n128 + BIN=$HOME/llama-phase6-source/build-cuda/bin + MOE=/home/mudler/bench/q36-35b-a3b-nvfp4.gguf + H2H=$HOME/bench/h2h_cli3.py + mkdir -p "$ART" + pkill -9 -f "[l]lama-server" 2>/dev/null || true + cd "$BIN" + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_CHUNK_MIN=1 GDN_TC=5 GGML_NO_BACKTRACE=1 \ + nsys profile --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true \ + -o "$ART/llama_n128" \ + ./llama-server -m "$MOE" -c 262144 --parallel 256 -b 2048 -ub 512 -ngl 99 -fa on \ + --host 127.0.0.1 --port 8092 --no-webui > "$ART/server.log" 2>&1 & + pid=$! + for i in $(seq 1 360); do + curl -s -m2 http://127.0.0.1:8092/health | grep -q ok && break + kill -0 "$pid" 2>/dev/null || { tail -30 "$ART/server.log"; exit 1; } + sleep 1 + done + python3 "$H2H" --url http://127.0.0.1:8092/v1/completions --model q36 -n 8 --ptok 128 --gen 32 \ + > "$ART/warmup.json" 2> "$ART/warmup.err" || true + python3 "$H2H" --url http://127.0.0.1:8092/v1/completions --model q36 -n 128 --ptok 128 --gen 64 \ + > "$ART/client_n128.json" 2> "$ART/client_n128.err" + kill "$pid" 2>/dev/null || true + for i in $(seq 1 60); do kill -0 "$pid" 2>/dev/null || break; sleep 1; done + kill -9 "$pid" 2>/dev/null || true + python3 $HOME/bench/bucket.py "$ART/llama_n128.nsys-rep" llama_phase8_n128 > "$ART/buckets.txt" + SH + bash /tmp/phase8_llama_nsys.sh' + ``` + + Expected: + + - `client_n128.json` contains `decode_agg_tps`, `decode_perseq_tps`, and + `prefill_tps`. + - `buckets.txt` has fine rows for `mm_ids`, `gather_mmq`, `act_quant`, + `mmq_nvfp4`, `set_rows`, `ew_add`, `gdn_core`, and `fa`. + + Result: + + - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/llama_n128_clean/`. + - Throughput: `decode_agg_tps=412.1`, `decode_perseq_tps=2.70`, + `prefill_tps=1368.3`. + - Clean rebuild was required before this run; the first `llama_n128/` profile + still contained the rejected weighted-combine kernel in the binary and is + not used for the decision. + - Bucket highlights: + - GDN: `4680.27 ms`, `38.12%`. + - `mmq_nvfp4`: `2745.11 ms`, `22.36%`. + - `act_quant`: `441.42 ms`, `3.60%`. + - MoE dispatch: `183.67 ms`, `1.50%`. + - `mm_ids`: `80.92 ms`, `0.66%`. + - `gather_mmq`: `50.96 ms`, `0.42%`. + - `ew_add`: `280.15 ms`, `2.28%`. + +- [x] **Step 4: Run serving nsys for vLLM MoE** + + Run on DGX: + + ```bash + ssh dgx.casa 'cat > /tmp/phase8_vllm_nsys.sh <<'"'"'SH'"'"' + #!/usr/bin/env bash + set -euo pipefail + ART=$HOME/bench/phase8_ragged_moe_dispatch/vllm_n128 + MODEL=/home/mudler/bench/q36-35b-a3b-nvfp4-vllm + H2H=$HOME/bench/h2h_cli3.py + mkdir -p "$ART" + pkill -9 -u "$(id -u)" -f "[v]llm serve" 2>/dev/null || true + export PATH="$HOME/vllm-bench/bin:$PATH" + export VLLM_LOGGING_LEVEL=INFO + export HF_HUB_OFFLINE=1 + nsys profile --trace=cuda --sample=none --cpuctxsw=none --force-overwrite=true \ + -o "$ART/vllm_n128" \ + "$HOME/vllm-bench/bin/vllm" serve "$MODEL" --served-model-name q36 \ + --gpu-memory-utilization 0.85 --max-model-len 4096 --max-num-seqs 256 \ + --host 127.0.0.1 --port 8002 --tensor-parallel-size 1 > "$ART/server.log" 2>&1 & + pid=$! + for i in $(seq 1 420); do + curl -s -m2 http://127.0.0.1:8002/v1/models | grep -q q36 && break + kill -0 "$pid" 2>/dev/null || { tail -40 "$ART/server.log"; exit 1; } + sleep 1 + done + python3 "$H2H" --url http://127.0.0.1:8002/v1/completions --model q36 -n 8 --ptok 128 --gen 32 \ + > "$ART/warmup.json" 2> "$ART/warmup.err" || true + python3 "$H2H" --url http://127.0.0.1:8002/v1/completions --model q36 -n 128 --ptok 128 --gen 64 \ + > "$ART/client_n128.json" 2> "$ART/client_n128.err" + kill "$pid" 2>/dev/null || true + for i in $(seq 1 80); do kill -0 "$pid" 2>/dev/null || break; sleep 1; done + kill -9 "$pid" 2>/dev/null || true + python3 $HOME/bench/bucket.py "$ART/vllm_n128.nsys-rep" vllm_phase8_n128 > "$ART/buckets.txt" + SH + bash /tmp/phase8_vllm_nsys.sh' + ``` + + Expected: + + - `client_n128.json` contains comparable throughput. + - `buckets.txt` has vLLM rows for `vllm_dispatch`, `vllm_fp4_gemm`, + `vllm_fa`, and `fla_gdn`. + + Result: + + - Artifact: `/home/mudler/bench/phase8_ragged_moe_dispatch/vllm_n128/`. + - Throughput: `decode_agg_tps=1036.6`, `decode_perseq_tps=7.02`, + `prefill_tps=5277.7`. + - Nsight includes startup/autotune and `delayStreamKernel`, so the aggregate + vLLM macro percentages are not directly comparable to llama.cpp. Direct + kernel extraction still shows Marlin-MoE rows around `1.0 s` and + `moe_align/topk/count` rows around `38.5 ms` in the full capture. + +- [x] **Step 5: Decide promotion** + + Promote to source only if all are true: + + - llama.cpp `MoE-dispatch` plus `MoE/FFN-GEMM` fine rows are a materially + larger share than expected from Phase 6 or worse than vLLM on the same + serving shape. + - `mm_ids`, `gather_mmq`, `act_quant`, or grouped `mmq_nvfp4` is a clear + target, not hidden by GDN or FA. + - Serving throughput gap is still visible in the same profile. + + Reject or defer if: + + - GDN remains the dominant gap. + - FA prefill dominates the profiled window. + - MoE dispatch is too small to beat a `+5%` serving A/B gate. + + Decision: + + - Promote to Task 2 test-gate work, not production source work yet. + - Rationale: standalone `mm_ids` and `gather_mmq` are small, but the live + ragged path around `mmq_nvfp4 + act_quant + MoE-dispatch + fan-in` is + roughly `29.7%` of llama.cpp kernel time. vLLM throughput is still much + higher on the same client shape. A production patch is only justified after + a ragged `MUL_MAT_ID` test gate exists and after the source prototype can + reduce the grouped-MMQ/activation movement bucket, not merely the helper + kernels. + - GDN remains the single largest bucket, so any Phase 8 source patch still + must clear the `+5%` serving A/B gate before being kept. + +- [x] **Step 6: Commit the profile decision** + + If promoted: + + ```bash + git add docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md + git commit -m "docs(paged): scope ragged MoE dispatch phase" \ + -m "Assisted-by: Codex:gpt-5" + ``` + + If rejected: + + ```bash + git add docs/superpowers/plans/2026-07-01-serving-ragged-moe-phase8.md \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md + git commit -m "docs(paged): reject ragged MoE dispatch phase" \ + -m "Assisted-by: Codex:gpt-5" + ``` + + Result: + + - Committed the profile decision as `89ef3a402` + (`docs(paged): record ragged MoE profile gate`). + - The follow-up test gate landed as fork commit `e21732fc4` and LocalAI + mirror commit `b009de0ee`. + - The source shortcut rejection landed as `b862e2c56` + (`docs(paged): stop ragged dispatch source shortcut`). + +## Task 2: Add Ragged `MUL_MAT_ID` Test Gate If Promoted + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/tests/test-backend-ops.cpp` +- Mirror patch under: + `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/` + +- [x] **Step 1: Add a test-only fork patch** + + Add a `MUL_MAT_ID_RAGGED_MOE` whole-graph test that exercises: + + - `type_a=nvfp4` + - `n_mats=256` + - `n_used=8` + - `n_tokens in {1, 8, 33, 128, 257}` + - explicitly empty experts and high skew into 1-row experts + + Result: + + - Fork commit: `e21732fc4` (`test(paged): cover ragged MoE dispatch`). + - LocalAI patch: + `0053-test-paged-cover-ragged-MoE-dispatch.patch`. + - Coverage: + - one small F32 wiring case, + - NVFP4 with `n_mats=256`, `n_used=8`, `m=768`, `k=2048`, + `n in {1, 8, 33, 128, 257}`. + - deterministic unique top-k ids skewed toward hot experts, including + expert `255`, with many empty experts. + +- [x] **Step 2: Run red/green if the test exposes a missing path** + + Run: + + ```bash + ./build-cuda/bin/test-backend-ops test -b CUDA0 -o MUL_MAT_ID_RAGGED_MOE -j 1 + ``` + + Expected after adding only the test: + + - Existing path should pass. If it fails, stop and debug before production + code. + + Result: + + - Initial test failed because the first deterministic ID pattern created + duplicate expert IDs within the same token, which is not a valid top-k + routing shape. The corrected gate preserves unique expert IDs per token. + - DGX artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/test_backend_ops_mul_mat_id_ragged_moe_fixed.txt`. + - Result: `MUL_MAT_ID_RAGGED_MOE` `6/6` on CUDA0. + +- [x] **Step 3: Mirror the test patch** + + Generate with: + + ```bash + git format-patch -1 --stdout > /tmp/0053-test-paged-cover-ragged-MoE-dispatch.patch + ``` + + Copy into LocalAI only after checking patch order. + + Result: + + - Patch `0053-test-paged-cover-ragged-MoE-dispatch.patch` added after + `0052-test-paged-cover-MoE-weighted-combine-chain.patch`. + +## Task 3: Default-Off Fused Dispatch Prototype If Promoted + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmid.cu` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cu` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + +**Status:** Rejected before production CUDA edits. The profile and code +inspection do not justify a metadata/helper-only prototype. + +Inspection result: + +- `ggml_cuda_mul_mat_q()` already runs the ids path as + `mm_ids_helper -> quantize/gather -> grouped MMQ`. +- For native FP4 MoE with broadcast activations (`ne11 == 1`), patch `0023` + already quantizes unique tokens once and gathers FP4 blocks: + `quantize_mmq_fp4_cuda(... ids=nullptr ...)` followed by + `gather_mmq_fp4_cuda(...)`. +- The live serving profile shows `mm_ids` at `0.66%` and `gather_mmq` at + `0.42%`, while `mmq_nvfp4` is `22.36%` and `act_quant` is `3.60%`. +- Therefore a safe Phase 8 production patch must change grouped-MMQ execution + shape or activation movement. A default-off hook that only skips or repacks + metadata is not expected to clear the `+5%` serving A/B gate. + +Stop condition: + +- Do not edit production CUDA for Phase 8 until there is a concrete design for + reducing `mmq_nvfp4` or `act_quant` time without D2H id readback, new stream + synchronizations, or md5 drift. + +- [x] **Step 1: Add env-gated entry point** + + Decision: not implemented. Adding a default-off env hook without a concrete + `mmq_nvfp4` or activation-movement reduction would add patch-stack conflict + surface while preserving the same slow path. + + Add a default-off env gate: + + ```cpp + static bool ggml_cuda_moe_fused_dispatch_enabled() { + static const bool enabled = [] { + const char * e = getenv("LLAMA_MOE_FUSED_DISPATCH"); + return e != nullptr && std::atoi(e) != 0; + }(); + return enabled; + } + ``` + + The default path must remain byte-identical and use the existing + `ggml_cuda_mul_mat_id` implementation. + +- [x] **Step 2: Add the smallest measurable fused metadata path** + + Decision: not implemented. The live profile puts the metadata helpers below + the `+5%` serving A/B threshold (`mm_ids=0.66%`, `gather_mmq=0.42%`), and + patch `0023` already avoids repeated activation quantization for the + broadcast-activation NVFP4 MoE case. + + Start by replacing repeated host/device metadata setup only when all are true: + + - CUDA backend. + - `src0->type == GGML_TYPE_NVFP4`. + - `ids` are already device-resident. + - decode-ish `src1->ne[1] <= 128`. + - no D2H id readback. + + If this cannot be done without syncs, stop and reject the prototype. + +- [x] **Step 3: Run gates** + + Rerun result from the unchanged production path: + + - Artifact: + `/home/mudler/bench/phase8_ragged_moe_dispatch/safety_rerun_20260701_035549/` + - MoE md5: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense md5: `5951a5b4d624ce891e22ab5fca9bc439`. + - Full `MUL_MAT_ID`: `806/806` on CUDA0. + - Specific `MUL_MAT_ID_RAGGED_MOE`: `6/6` on CUDA0, rerun artifact + `/home/mudler/bench/phase8_ragged_moe_dispatch/ragged_gate_rerun_20260701_035529.txt`. + + Run on DGX: + + ```bash + ./test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 + ``` + + Expected: `806/806`. + + Run transcript gates: + + ```bash + env LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_CHUNK_MIN=1 GDN_TC=5 GGML_NO_BACKTRACE=1 \ + ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -c 4096 \ + --temp 0 --seed 1 -n 48 -p "The capital of France is" **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Add a default-off, md5-safe classifier/trace for decode-small-M MoE grouped-MMQ candidates before building any alternate numeric kernel. + +**Architecture:** Extend the existing host-only MMQ trace helper with a pure small-M predicate and format helper. Wire a bounded `[LLAMA_MOE_MMQ_SMALL_M]` trace in `mul_mat_q_case` after `mmq_x_best` is selected, using a separate env `LLAMA_MOE_MMQ_SMALL_M_TRACE=` so normal shape tracing behavior remains unchanged. + +**Tech Stack:** llama.cpp CUDA backend, host-only C++ unit test, LocalAI paged patch series, DGX GB10 md5/op gates. + +--- + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq-shape-trace.h` + - Add `ggml_cuda_mmq_small_m_shape`, make/format helpers, and candidate predicate. +- Modify: `/home/mudler/_git/llama.cpp/tests/test-cuda-mmq-shape-trace.cpp` + - Add RED/GREEN assertions for decode-like inclusion and prefill/dense exclusion. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/mmq.cuh` + - Add `LLAMA_MOE_MMQ_SMALL_M_TRACE=` parser and bounded trace emission. +- Create: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/patches/paged/0058-feat-cuda-trace-moe-small-m-mmq-candidates.patch` +- Modify docs: README, GB10 results, lever map, handoff, patch maintenance, and this plan. + +## Checklist + +- [x] **Step 1: RED host test** + - Add test calls to `ggml_cuda_mmq_small_m_shape_make` and assert candidate true for `is_moe=true`, `ncols_dst=1024`, `nchannels_x=256`, `ncols_max=128`, `mmq_x_best=64`, `use_stream_k=true`. + - Assert false for dense (`is_moe=false`), prefill (`ncols_max=512`), high density (`ncols_dst=4096`), large tile (`mmq_x_best=128`), and no stream-k. + - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4`. + - Expected: compile failure because the helper does not exist. + +- [x] **Step 2: GREEN host helper** + - Add helper structs/functions in `mmq-shape-trace.h`. + - Run: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`. + - Expected: pass. + +- [x] **Step 3: Wire default-off trace** + - Add `ggml_cuda_moe_mmq_small_m_trace_limit()`. + - Emit `[LLAMA_MOE_MMQ_SMALL_M]` only when `args.expert_bounds != nullptr`, helper says candidate, and the trace limit allows it. + - No numeric branch or tile change in this patch. + +- [x] **Step 4: DGX build and gates** + - Build `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`. + - Run default-off gates: MoE `8cb0ce23777bf55f92f63d0292c756b0`, dense `5951a5b4d624ce891e22ab5fca9bc439`, `MUL_MAT_ID 806/806`. + - Run trace-enabled gates with `EXTRA_ENV=LLAMA_MOE_MMQ_SMALL_M_TRACE=4`; expected same md5/op values and four small-M trace lines from MoE only. + +- [x] **Step 5: n128 serving count** + - Run h2h n128 with `LLAMA_MOE_MMQ_SMALL_M_TRACE=4096`. + - Parse small-M lines and compare count to Phase 30/31 decode-like launch count. + - Run post-serving gates. + +- [x] **Step 6: Mirror and docs** + - Commit fork, generate LocalAI patch `0058`. + - Verify strict patch-series tree equals fork tree. + - Update docs and mark this checklist complete with artifact path and decision. + - Commit LocalAI with `Assisted-by: Codex:gpt-5`. + +## Result + +- Fork commit: `/home/mudler/_git/llama.cpp` `2a9964d29 feat(cuda): trace moe small-m mmq candidates`. +- DGX mirror commit: `dgx:~/llama-phase6-source` `024f494d0 feat(cuda): trace moe small-m mmq candidates`. +- Artifact: `/home/mudler/bench/phase32_small_m_classifier/20260701_070127`. +- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_small_m_shape`. +- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`. +- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`. +- Default-off, trace-enabled, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +- n128 traced serving: `decode_agg_tps=689.0`, `agg_tps=343.9`, `prefill_tps=1566.5`, `TTFT mean=7849.0 ms`. +- Small-M candidate trace: `4096` candidate calls in the first serving trace window. + - `mmq_x_best`: `64` 1800, `48` 1096, `40` 360, `32` 360, `16` 360, `24` 120. + - density: `4` 1440, `3` 1336, `1` 840, `2` 480. + +Decision: Phase 33 can A/B a default-off small-M tile policy, with `mmq_x=16` and possibly `8` as the first candidates. The classifier shows enough live candidate coverage to justify an opt-in tile-policy experiment, while preserving the existing MMQ path and md5 gates. diff --git a/docs/superpowers/plans/2026-07-01-small-m-tile-policy-phase33.md b/docs/superpowers/plans/2026-07-01-small-m-tile-policy-phase33.md new file mode 100644 index 000000000000..3106bfbfa141 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-small-m-tile-policy-phase33.md @@ -0,0 +1,56 @@ +# Small-M Tile Policy Phase 33 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** A/B a default-off MoE-only small-M tile policy using Phase 32 candidate criteria, starting with `LLAMA_MOE_SMALL_M_TILE=16`. + +**Architecture:** Add a narrow host-side override in `mul_mat_q_case`: after the normal MoE density auto-tile logic, if `LLAMA_MOE_SMALL_M_TILE=` is set and the call is decode-like (`ncols_max <= 128`, density `<=4`, stream-k), cap `mmq_x_lim` to that tile. The existing MMQ kernels and launch path remain unchanged; unsupported/default cases fall through unchanged. + +**Tech Stack:** llama.cpp CUDA backend, host-only selector tests, DGX GB10 md5/op gates and n128 h2h serving A/B. + +--- + +## Checklist + +- [x] **Step 1: RED selector test** + - Add host helper assertions for `ggml_cuda_mmq_small_m_tile_limit`. + - Expected: compile failure before helper exists. + +- [x] **Step 2: GREEN helper** + - Implement helper in `mmq-shape-trace.h`. + - Local test passes. + +- [x] **Step 3: Wire env policy** + - Add `LLAMA_MOE_SMALL_M_TILE`. + - Apply only to MoE grouped-MMQ small-M candidates. + - Default path unchanged. + +- [x] **Step 4: DGX gates** + - Build CUDA targets. + - Run default-off gates. + - Run `EXTRA_ENV=LLAMA_MOE_SMALL_M_TILE=16` gates. + +- [x] **Step 5: n128 A/B** + - Same-session baseline vs `LLAMA_MOE_SMALL_M_TILE=16`, h2h n128. + - Post-serving gates. + +- [x] **Step 6: Mirror/docs** + - Generate patch `0059`. + - Strict patch-series tree check. + - Update docs and commit LocalAI. + +## Result + +- Fork commit: `/home/mudler/_git/llama.cpp` `fbed2abaa feat(cuda): gate moe small-m mmq tile policy`. +- DGX mirror commit: `dgx:~/llama-phase6-source` `dfd1eaea8 feat(cuda): gate moe small-m mmq tile policy`. +- Artifact: `/home/mudler/bench/phase33_small_m_tile_policy/20260701_071136`. +- RED verified: `cmake --build build --target test-cuda-mmq-shape-trace -j 4` failed on missing `ggml_cuda_mmq_small_m_tile_limit`. +- GREEN verified locally: `cmake --build build --target test-cuda-mmq-shape-trace -j 4 && ./build/bin/test-cuda-mmq-shape-trace`. +- DGX CUDA build verified: `llama-server`, `llama-completion`, `test-backend-ops`, and `test-cuda-mmq-shape-trace`. +- Default-off, tile16, tile8, and post-serving gates all matched MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`, dense md5 `5951a5b4d624ce891e22ab5fca9bc439`, and `MUL_MAT_ID` `806/806`. +- Same-session n128 serving: + - baseline: `decode_agg_tps=672.1`, `agg_tps=339.5`, `prefill_tps=1511.4`. + - `LLAMA_MOE_SMALL_M_TILE=16`: `decode_agg_tps=640.3`, `agg_tps=328.9`, `prefill_tps=1522.2`, ratio `0.953x`. + - `LLAMA_MOE_SMALL_M_TILE=8`: `decode_agg_tps=583.2`, `agg_tps=307.4`, `prefill_tps=1442.6`, ratio `0.868x`. + +Decision: reject smaller `mmq_x` caps for the classified n128 small-M calls. They are md5/op safe but slower. The next structural direction must not be a simple smaller tile cap; it needs a different kernel shape or a different target bucket. diff --git a/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md new file mode 100644 index 000000000000..86b937071cc3 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-snapshot-gate-summary-phase25.md @@ -0,0 +1,121 @@ +# Snapshot Gate Summary Phase 25 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make current-stack paged-vs-vLLM serving artifacts prove that +inference md5/op gates stayed green without requiring a full log read. + +**Architecture:** extend the existing current serving snapshot harness with a +compact gate-summary writer. Keep it additive and outside llama.cpp source: no +patch-series change and no inference behavior change. + +**Tech Stack:** Bash, Python stdlib, existing `paged-inference-gates.sh` +artifacts. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove Phase 20 lacks compact gate proof** + + Command: + + ```bash + ssh dgx.casa 'test -e ~/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv' + ``` + + Result: + + - exited `1` before the patch, while `gate_pre/`, `gate_post/`, and full gate + logs existed. + +## Task 2: Add Gate Summary + +- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Behavior: + + - writes `$ART/gate_summary.tsv` after the post gate in a full serving run; + - records pre/post MoE md5, dense md5, and backend op status; + - compares MoE against `8cb0ce23777bf55f92f63d0292c756b0`; + - compares dense against `5951a5b4d624ce891e22ab5fca9bc439`; + - parses op pass counts such as `806/806 tests passed`; + - exits non-zero if an existing gate artifact is missing, mismatched, or not + fully passing; + - supports `--summarize-gates ART` to audit existing artifacts without running + servers. + +## Task 3: Verify + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - both passed. + +- [x] **Step 2: Backfill Phase 20 gate summary** + + Command: + + ```bash + /tmp/paged-current-serving-snapshot.sh \ + --summarize-gates ~/bench/phase20_current_snapshot/20260701_050621 + ``` + + Result: + + - wrote `/home/mudler/bench/phase20_current_snapshot/20260701_050621/gate_summary.tsv`; + - pre/post MoE md5 rows were `ok`; + - pre/post dense md5 rows were `ok`; + - pre/post `MUL_MAT_ID` rows were `ok` with `806/806`. + +- [x] **Step 3: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase25_gate_summary_dryrun/20260701_053353 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - `hardware.txt` was still written; + - no paged or vLLM server launched; + - no `gate_summary.tsv` was written before gates existed. + + Artifact: + + - `/home/mudler/bench/phase25_gate_summary_dryrun/20260701_053353` + +## Task 4: Record Result + +- [x] **Step 1: Update parity docs** + + Updated files: + + - `backend/cpp/llama-cpp-localai-paged/README.md` + - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +## Self-Review + +- No llama.cpp source behavior changed. +- Future full snapshots now contain compact proof of pre/post md5 and op gates. +- The summary-only mode lets old artifacts be audited without consuming GPU + benchmark time. diff --git a/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md b/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md new file mode 100644 index 000000000000..411ce1a2373d --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-snapshot-hardware-report-phase24.md @@ -0,0 +1,112 @@ +# Snapshot Hardware Report Phase 24 Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use +> superpowers:verification-before-completion before recording the phase result. +> Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** make current-stack paged-vs-vLLM serving snapshots record the hardware +class so GB10/workstation Blackwell results are not confused with future +datacenter-Blackwell parity runs. + +**Architecture:** extend the existing current serving snapshot harness with a +small pre-server hardware report. Keep it additive and outside llama.cpp source: +no patch-series change, no inference behavior change, and no GPU server launch +in dry-run mode. + +**Tech Stack:** Bash, `nvidia-smi`, DGX GB10. + +--- + +## Task 1: Red Check + +- [x] **Step 1: Prove the previous dry-run artifact lacks hardware identity** + + Command: + + ```bash + ssh dgx.casa 'test -e ~/bench/phase21_harness_dryrun/20260701_051757/hardware.txt' + ``` + + Result: + + - exited `1`, confirming the existing harness did not write a hardware report. + +## Task 2: Add Hardware Report + +- [x] **Step 1: Extend `paged-current-serving-snapshot.sh`** + + File: + + - `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + + Behavior: + + - writes `$ART/hardware.txt` immediately after preflight; + - records `nvidia-smi -L`; + - records GPU name, driver, memory, and compute capability when available; + - falls back if `compute_cap` is unavailable in `nvidia-smi`; + - classifies hardware as `datacenter_blackwell`, `datacenter_other`, + `gb10_or_workstation_blackwell`, or `unknown`; + - writes a parity note for the detected hardware class; + - runs in `DRY_RUN=1` before the script exits. + +## Task 3: Verify + +- [x] **Step 1: Local syntax/help checks** + + Commands: + + ```bash + bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh + backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh --help + ``` + + Result: + + - both passed. + +- [x] **Step 2: DGX dry run** + + Command: + + ```bash + DRY_RUN=1 ART=~/bench/phase24_hardware_report_dryrun/20260701_052741 \ + /tmp/paged-current-serving-snapshot.sh + ``` + + Result: + + - preflight verified `docker=0`, `local_ai_worker=0`, `compute=0`; + - no paged or vLLM server launched; + - `hardware.txt` was written. + + Artifact: + + - `/home/mudler/bench/phase24_hardware_report_dryrun/20260701_052741` + + Hardware report: + + ```text + GPU 0: NVIDIA GB10 + driver=580.159.03 + compute_cap=12.1 + hardware_class=gb10_or_workstation_blackwell + ``` + +## Task 4: Record Result + +- [x] **Step 1: Update parity docs** + + Updated files: + + - `backend/cpp/llama-cpp-localai-paged/README.md` + - `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` + - `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + - `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` + +## Self-Review + +- No llama.cpp source behavior changed. +- The harness remains dry-run safe. +- Future snapshot artifacts now carry enough hardware identity to separate GB10 + closure evidence from datacenter-Blackwell parity evidence. diff --git a/docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md b/docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md new file mode 100644 index 000000000000..2b401839f193 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md @@ -0,0 +1,108 @@ +# Target Reconciliation Phase42 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Reconcile the post-Phase41 target list so the next parity phase does not chase a closed D1/GDN/W4A16 premise. + +**Architecture:** Use read-only parallel subagent analysis over D1 graph capture, GDN prefill, and W4A16/MoE prefill GEMM. Record the resulting target decision in the parity docs. + +**Tech Stack:** LocalAI docs, llama.cpp patch mirrors, `/home/mudler/_git/llama.cpp` fork, Git. + +--- + +### Task 1: Run Parallel Target Reviews + +**Files:** +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0040-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0041-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0043-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0031-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0046-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0047-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0033-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0034-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0035-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0048-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0049-*.patch` +- Read: `backend/cpp/llama-cpp-localai-paged/patches/paged/0050-*.patch` + +- [x] **Step 1: Review D1** + +Ask a read-only explorer to reconcile whether D1/full-step graph capture is shipped or still open. + +Observed: + +```text +D1/full-step MoE decode CUDA graph capture is shipped and default-on. +The host-sync premise is closed/refuted for current GB10 NVFP4 grouped-MMQ decode. +``` + +- [x] **Step 2: Review GDN** + +Ask a read-only explorer to inspect GDN tensor-core/chunking state. + +Observed: + +```text +0046/0047 are shipped GB10 wins. +0031 scalar chunking stayed opt-in/slower. +C32 slab, QS-early, and Global-Ai32 were correctness-clean but slower. +Do not add another GDN GB10 patch. +``` + +- [x] **Step 3: Review W4A16/GEMM** + +Ask a read-only explorer to inspect the prefill GEMM / W4A16 state. + +Observed: + +```text +0033/0034/0035 are default-off. +0048/0049/0050 improve forced W4A16 only marginally. +Production defaults still use FP4-MMQ. +Do not add another small W4A16 body/metadata patch. +``` + +### Task 2: Record Phase42 Decision + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md` +- Create: `docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md` + +- [x] **Step 1: Correct Phase41 D1 wording** + +Change Phase41 from "D1 remains relevant" to "low-concurrency remains a gap, but D1 graph capture is already shipped/default-on and not reopened." + +- [x] **Step 2: Add Phase42 decision** + +Record: + +```text +D1: closed on current GB10 path. +GDN: low-conflict GB10 work exhausted. +W4A16/GEMM: micro-patch track exhausted. +Next small GB10 source candidate: persistent/load-time F32 combined gate projection. +``` + +- [x] **Step 3: Verify and commit** + +Run: + +```bash +git diff --check +git status --short +``` + +Commit with: + +```bash +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md \ + docs/superpowers/plans/2026-07-01-low-concurrency-phase41.md +git add -f docs/superpowers/plans/2026-07-01-target-reconciliation-phase42.md +git commit -m "docs(paged): reconcile next parity target" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md new file mode 100644 index 000000000000..5d8b2c1686e9 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-cap-phase57.md @@ -0,0 +1,109 @@ +# Phase57 TTFT Prefill-First Cap Sweep Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether a per-step cap on `LLAMA_TTFT_PREFILL_FIRST=1` avoids the MoE mean-TTFT regression seen in Phase56 while preserving dense gains. + +**Architecture:** Add a small optional cap to the existing default-off Phase55 policy. Unset or zero cap keeps Phase55 unlimited behavior. Gate with focused unit tests, then temporarily apply the stack to DGX for md5/op gates and an A/B cap sweep. + +**Tech Stack:** llama.cpp fork, `tools/server/server-admission-policy.h`, `tools/server/server-context.cpp`, DGX GB10, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Add capped helper + +- [x] **Step 1: Write red test** + +Added test cases for: + +- zero cap means unlimited +- below cap defers +- at cap stops deferring + +Observed red failure: the helper accepted only three arguments. + +- [x] **Step 2: Implement cap helper and env** + +Added overload: + +```cpp +server_admission_should_defer_decode_for_ttft(enabled, prompt_waiting, n_decoded, deferred_so_far, max_deferred) +``` + +Added `LLAMA_TTFT_PREFILL_FIRST_MAX_DEFER`. Unset or `0` keeps unlimited +Phase55 behavior. + +- [x] **Step 3: Verify local** + +Commands passed: + +```bash +cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2 +./build/bin/test-server-admission-policy +./build/bin/test-server-admission-trace +ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure +``` + +- [x] **Step 4: Commit fork patch** + +Local fork commit: + +```text +3b6ab5fa8 feat(server): cap TTFT prefill-first decode deferral +``` + +### Task 2: DGX gate and cap sweep + +- [x] **Step 1: Preflight and build** + +Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase56-validation 1782900217`, clean mirror at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +Applied `/tmp/phase57-ttft-cap-stack.patch`, built focused tests, +`llama-server`, `llama-cli`, and `test-backend-ops`. DGX focused CTests passed. + +- [x] **Step 2: Run pre/post gates** + +Artifact: `/home/mudler/bench/phase57_ttft_cap_sweep/20260701_120830`. + +Pre and post gates matched: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 3: Run MoE cap sweep** + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `337.1` | `652.0` | `1516.1` | `7425.5` | `11735.7` | `24.299` | `0` | +| cap16 | `330.2` | `611.5` | `1559.6` | `7589.4` | `11407.9` | `24.802` | `111` | +| cap32 | `335.3` | `624.6` | `1572.4` | `6994.0` | `11315.5` | `24.429` | `236` | +| cap64 | `327.1` | `589.6` | `1596.9` | `7533.2` | `11141.5` | `25.025` | `339` | + +- [x] **Step 4: Run dense cap sweep** + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `141.4` | `360.6` | `650.8` | `22423.5` | `35209.6` | `57.925` | `0` | +| cap32 | `139.7` | `340.1` | `663.1` | `20346.5` | `34556.0` | `58.645` | `322` | +| cap64 | `136.3` | `333.4` | `645.2` | `22461.1` | `35511.7` | `60.081` | `490` | + +- [x] **Step 5: Revert DGX stack** + +Reverted the temporary patch stack, removed introduced files, and released the +lock as `FREE released-by-codex-phase57-cap 1782901003`. + +### Task 3: Decision + +- [x] **Step 1: Record outcome** + +Decision: reject the cap as a parity lever. MoE cap32 improves mean TTFT versus +same-window default but still slightly loses aggregate and wall. Dense caps lose +aggregate versus the same-window default, and cap64 is broadly worse. diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-phase55.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-phase55.md new file mode 100644 index 000000000000..107d84a1e9a5 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-phase55.md @@ -0,0 +1,326 @@ +# Phase55 TTFT Prefill-First Scheduler A/B Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test a default-off scheduler A/B that prioritizes first-token admission by deferring token 2+ decode while any prompt still has not reached first token. + +**Architecture:** Implement fork-first in `/home/mudler/_git/llama.cpp` on `localai-paged`. Keep default behavior unchanged. Add a tiny tested scheduler helper, wire it into `server_context_impl::pre_decode()` behind `LLAMA_TTFT_PREFILL_FIRST=1`, extend `LLAMA_SERVING_TRACE=1` with deferred-decode counters, then verify locally and on DGX with md5/op gates and a dense `n=128` A/B. Do not regenerate LocalAI patches until the fork branch is pushed with explicit approval. + +**Tech Stack:** llama.cpp server scheduler, CMake unit tests, DGX GB10 `build-cuda`, `paged-inference-gates.sh`, `h2h_cli.py`. + +--- + +### Task 1: Reconcile Current Patch State + +**Files:** +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify later: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Record current fork and mirror state** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +git status --short +git log --oneline -5 +git rev-list --left-right --count fork/localai-paged...HEAD || true + +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention +git status --short +ls backend/cpp/llama-cpp-localai-paged/patches/paged | tail +``` + +Expected: llama.cpp fork is clean at `bd7b2e952` before Phase55, with local +trace commits not mirrored into LocalAI patches. LocalAI worktree may still have +the unrelated untracked `.claude/`. + +Observed: fork was clean at `bd7b2e952` before Phase55, and after implementation +is clean at `8a97629a4`. It is `18` commits ahead of `fork/localai-paged`. +LocalAI patches still stop at `0063`. + +- [x] **Step 2: Keep mirror blocked until push approval** + +Document that Phase51, Phase54, and Phase55 fork commits are local only. Do not +edit `backend/cpp/llama-cpp-localai-paged/patches/paged/*.patch` directly. + +### Task 2: Add Red Scheduler Helper Test + +**Files:** +- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-policy.h` +- Create: `/home/mudler/_git/llama.cpp/tests/test-server-admission-policy.cpp` +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + +- [x] **Step 1: Create the failing helper test** + +Add a test that calls: + +```cpp +server_admission_should_defer_decode_for_ttft(false, true, 8) == false +server_admission_should_defer_decode_for_ttft(true, false, 8) == false +server_admission_should_defer_decode_for_ttft(true, true, 0) == false +server_admission_should_defer_decode_for_ttft(true, true, 1) == true +server_admission_should_defer_decode_for_ttft(true, true, 64) == true +``` + +- [x] **Step 2: Run red** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-server-admission-policy -j2 +``` + +Expected: build fails because `server-admission-policy.h` or the helper does +not exist. + +Observed: after reconfiguring CMake, the build failed because +`../tools/server/server-admission-policy.h` did not exist. + +### Task 3: Implement Helper and Trace Counter + +**Files:** +- Create: `/home/mudler/_git/llama.cpp/tools/server/server-admission-policy.h` +- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-admission-trace.h` +- Modify: `/home/mudler/_git/llama.cpp/tests/test-server-admission-trace.cpp` +- Modify: `/home/mudler/_git/llama.cpp/tests/test-server-admission-policy.cpp` +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + +- [x] **Step 1: Add the helper** + +Implement: + +```cpp +static inline bool server_admission_should_defer_decode_for_ttft( + bool enabled, + bool prompt_waiting, + int32_t n_decoded) { + return enabled && prompt_waiting && n_decoded > 0; +} +``` + +- [x] **Step 2: Add trace counter** + +Add `ttft_deferred_decode_slots` to `server_admission_trace_step` and +`server_admission_trace_totals`, accumulate it, and format it as +`ttft_deferred_decode_slots=`. + +- [x] **Step 3: Verify local tests** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-server-admission-policy test-server-admission-trace -j2 +./build/bin/test-server-admission-policy +./build/bin/test-server-admission-trace +ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure +``` + +Expected: both tests pass. + +Observed: both tests passed locally and under CTest. + +### Task 4: Wire Default-Off Scheduler A/B + +**Files:** +- Modify: `/home/mudler/_git/llama.cpp/tools/server/server-context.cpp` + +- [x] **Step 1: Include the helper** + +Add `#include "server-admission-policy.h"` beside the trace include. + +- [x] **Step 2: Detect prompt backlog before collecting generating slots** + +Before the generating-slot loop, scan slots for: + +```cpp +slot.state == SLOT_STATE_STARTED || slot.state == SLOT_STATE_PROCESSING_PROMPT +``` + +Store this as `ttft_prompt_waiting`. + +- [x] **Step 3: Defer token 2+ decode when enabled** + +Inside the generating-slot loop, before touching `slot_batched`, skip slots +where: + +```cpp +server_admission_should_defer_decode_for_ttft( + ttft_prefill_first, + ttft_prompt_waiting, + slot.n_decoded) +``` + +Increment `serving_trace_step.ttft_deferred_decode_slots` for each skipped +slot when trace is enabled. + +- [x] **Step 4: Verify local build** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2 +ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure +``` + +Expected: build and focused tests pass. + +Observed: focused tests passed and `llama-server` built. Local UI provisioning +used the repo fallback bundle path after the local Node engine mismatch. + +### Task 5: Commit Fork Patch + +**Files:** +- Fork commit only + +- [x] **Step 1: Commit locally** + +Commit message: + +```text +feat(server): add TTFT prefill-first scheduler mode +``` + +Include trailer: + +```text +Assisted-by: Codex:gpt-5 +``` + +Do not push without explicit approval. + +Local fork commit: + +```text +8a97629a4 feat(server): add TTFT prefill-first scheduler mode +``` + +### Task 6: DGX Verification and A/B + +**Files:** +- DGX mirror: `~/llama-phase6-source` +- Artifact: `~/bench/phase55_ttft_prefill_first/` + +- [x] **Step 1: Preflight** + +Require: Docker `0`, `local-ai-worker` `0`, compute apps `0`, lock `FREE*`. + +Observed: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase54-hist 1782898659`, DGX mirror clean at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +- [x] **Step 2: Apply temporary stack and build** + +Apply the local fork stack to the clean DGX mirror, build +`test-server-admission-policy`, `test-server-admission-trace`, `llama-server`, +`llama-cli`, and `test-backend-ops`. + +Observed: CMake reconfiguration was required for the new test target. After +reconfigure, all requested targets built and focused CTests passed on DGX. + +- [x] **Step 3: Run pre and post gates** + +Use: + +```bash +BIN=$HOME/llama-phase6-source/build-cuda/bin \ +ART=$ART/gate_pre \ +OPS=MUL_MAT,MUL_MAT_ID \ + $HOME/paged-inference-gates.sh +``` + +Repeat for `gate_post`. Expected md5/op gates: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +Observed: `gate_pre`, `gate_post`, and an extra `gate_after_ab` all matched the +expected md5/op gates. + +- [x] **Step 4: Run dense A/B** + +Run Phase54 baseline shape with `LLAMA_SERVING_TRACE=1`: + +- `n=128` +- `ptok=168` +- `gen=64` +- `--parallel 128` +- `-c 131072 -b 2048 -ub 512` + +Run variants: + +- default +- `LLAMA_TTFT_PREFILL_FIRST=1` + +Record h2h JSON and trace line for both. + +Artifact: `/home/mudler/bench/phase55_ttft_prefill_first/20260701_114929`. + +Default: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22913, "gen_per_req": 64.0, "agg_tps": 138.2, "decode_agg_tps": 361.3, "decode_perseq_tps": 1.91, "prefill_tps": 626.0, "ttft_mean_ms": 23231.9, "ttft_max_ms": 36599.5, "wall_s": 59.272} +``` + +```text +steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=34 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=0 prompt_hist=0:63,1-64:1,513+:12 decode_hist=0:3,1-63:10,64-127:10,128-255:53 waiting_hist=0:63,1-7:1,8-15:2,16-31:9,32-63:1 +``` + +`LLAMA_TTFT_PREFILL_FIRST=1`: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 22913, "gen_per_req": 64.0, "agg_tps": 142.9, "decode_agg_tps": 336.9, "decode_perseq_tps": 1.86, "prefill_tps": 694.2, "ttft_mean_ms": 21520.8, "ttft_max_ms": 33008.2, "wall_s": 57.323} +``` + +```text +steps=76 decode_only_steps=0 decode_tokens=8064 prompt_tokens=22913 waiting_prompt_slots=267 max_waiting_prompt_slots=35 started_prompt_slots=128 continued_prompt_slots=139 ttft_deferred_decode_slots=660 prompt_hist=0:63,1-64:1,257-512:1,513+:11 decode_hist=0:13,128-255:63 waiting_hist=0:63,1-7:1,8-15:3,16-31:8,32-63:1 +``` + +- [x] **Step 5: Decide** + +Accept Phase55 only if it improves TTFT without material aggregate throughput +loss, or improves aggregate throughput without TTFT collapse. Reject if it +mostly shifts cost from late prompts to already-started streams. + +Decision: keep as a promising default-off A/B. On this dense shape it improved +aggregate throughput by `+3.4%`, prefill throughput by `+10.9%`, mean TTFT by +`-7.4%`, max TTFT by `-9.8%`, and wall time by `-3.3%`. Decode-agg fell by +`-6.8%`, which is expected because the policy explicitly shifts early compute +toward first-token prompt admission. + +- [x] **Step 6: Revert DGX mirror** + +Reverse the temporary patch stack, remove any untracked files introduced by the +patch, release the lock, and verify no GPU compute apps remain. + +Observed: temporary stack reverted, trace/policy files removed from the clean +mirror, lock released as `FREE released-by-codex-phase55-ttft 1782899730`, and +no compute apps were reported. + +### Task 7: Record Results + +**Files:** +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-ttft-prefill-first-phase55.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- Modify: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +- [x] **Step 1: Mark completed steps** + +Update this plan as each step completes. + +- [x] **Step 2: Commit LocalAI docs** + +Commit with: + +```text +docs(paged): record TTFT prefill-first A/B + +Assisted-by: Codex:gpt-5 +``` diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md new file mode 100644 index 000000000000..60c3c41eafa8 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-validation-phase56.md @@ -0,0 +1,119 @@ +# Phase56 TTFT Prefill-First Validation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Validate the Phase55 default-off `LLAMA_TTFT_PREFILL_FIRST=1` scheduler A/B beyond dense `n=128` before any default-on discussion. + +**Architecture:** Do not change code. Temporarily apply the already-local Phase51+Phase54+Phase55 fork stack to the clean DGX mirror, reuse the gated `build-cuda` path, bracket runs with md5/op gates, then compare default vs opt-in on MoE `n=128` and dense lower-concurrency `n=32`. + +**Tech Stack:** DGX GB10, llama.cpp `build-cuda`, `LLAMA_SERVING_TRACE=1`, `LLAMA_TTFT_PREFILL_FIRST=1`, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Prepare DGX Stack + +- [x] **Step 1: Preflight** + +Require: Docker `0`, `local-ai-worker` `0`, GPU compute apps `0`, lock `FREE*`, +and clean `~/llama-phase6-source`. + +Observed: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase55-ttft 1782899730`, DGX mirror clean at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +- [x] **Step 2: Apply stack and build** + +Apply `/tmp/phase55-ttft-prefill-first-stack.patch` or regenerate the same stack +from `/home/mudler/_git/llama.cpp`. Reconfigure CMake if needed, then build +`llama-server`, `llama-cli`, and `test-backend-ops`. + +Observed: stack applied, CMake reconfigured, and requested targets built. + +### Task 2: Gate Before Validation + +- [x] **Step 1: Run canonical pre-validation gate** + +Expected: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +Observed: all expected pre-validation gates matched. + +### Task 3: Run A/B Matrix + +- [x] **Step 1: Run MoE `n=128` default and opt-in** + +Model: `~/bench/q36-35b-a3b-nvfp4.gguf`. +Shape: `--parallel 128`, `-c 131072`, `-b 2048`, `-ub 512`, `n=128`, +`ptok=128`, `gen=64`. + +Default: + +```json +{"n": 128, "reqs": 128, "gen_total": 8191, "prompt_tok_total": 17793, "gen_per_req": 64.0, "agg_tps": 341.1, "decode_agg_tps": 651.2, "decode_perseq_tps": 3.93, "prefill_tps": 1555.9, "ttft_mean_ms": 7168.1, "ttft_max_ms": 11435.5, "wall_s": 24.015} +``` + +`LLAMA_TTFT_PREFILL_FIRST=1`: + +```json +{"n": 128, "reqs": 128, "gen_total": 8192, "prompt_tok_total": 17793, "gen_per_req": 64.0, "agg_tps": 339.9, "decode_agg_tps": 623.8, "decode_perseq_tps": 3.92, "prefill_tps": 1622.7, "ttft_mean_ms": 7615.3, "ttft_max_ms": 10964.4, "wall_s": 24.098} +``` + +- [x] **Step 2: Run dense `n=32` default and opt-in** + +Model: `~/bench/q36-27b-nvfp4.gguf`. +Shape: `--parallel 128`, `-c 131072`, `-b 2048`, `-ub 512`, `n=32`, +`ptok=168`, `gen=64`. + +Default: + +```json +{"n": 32, "reqs": 32, "gen_total": 2048, "prompt_tok_total": 5700, "gen_per_req": 64.0, "agg_tps": 104.3, "decode_agg_tps": 197.1, "decode_perseq_tps": 5.42, "prefill_tps": 617.2, "ttft_mean_ms": 7687.7, "ttft_max_ms": 9234.4, "wall_s": 19.627} +``` + +`LLAMA_TTFT_PREFILL_FIRST=1`: + +```json +{"n": 32, "reqs": 32, "gen_total": 2048, "prompt_tok_total": 5700, "gen_per_req": 64.0, "agg_tps": 106.7, "decode_agg_tps": 193.5, "decode_perseq_tps": 5.37, "prefill_tps": 662.1, "ttft_mean_ms": 7284.3, "ttft_max_ms": 8609.1, "wall_s": 19.194} +``` + +### Task 4: Gate After Validation and Clean DGX + +- [x] **Step 1: Run canonical post-validation gate** + +Expected md5/op values match Task 2. + +Observed: all expected post-validation gates matched. + +- [x] **Step 2: Revert temporary DGX stack** + +Reverse the patch, remove untracked files introduced by the stack, release the +lock, and verify no compute apps remain. + +Observed: stack reverted, introduced files removed, lock released as +`FREE released-by-codex-phase56-validation 1782900217`, and no compute apps +were reported. + +### Task 5: Record Decision + +- [x] **Step 1: Update parity docs** + +Record the artifact, all A/B rows, trace counters, gates, and whether the policy +remains promising, is rejected, or needs narrower gating. + +Decision: keep the policy opt-in only. Dense `n=32` improved aggregate and TTFT, +but MoE `n=128` slightly regressed aggregate and mean TTFT, so the policy is not +safe as a broad default. + +- [x] **Step 2: Commit LocalAI docs** + +Use: + +```text +docs(paged): validate TTFT prefill-first A/B + +Assisted-by: Codex:gpt-5 +``` diff --git a/docs/superpowers/plans/2026-07-01-ttft-prefill-first-waiting-threshold-phase58.md b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-waiting-threshold-phase58.md new file mode 100644 index 000000000000..8c64e7c3e737 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-ttft-prefill-first-waiting-threshold-phase58.md @@ -0,0 +1,106 @@ +# Phase58 TTFT Prefill-First Waiting-Threshold Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Test whether activating TTFT prefill-first only during high prompt-backlog windows keeps the MoE benefit without the broad-defer regressions from Phase56-57. + +**Architecture:** Add `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING` as a default-off refinement. Unset or zero keeps the existing Phase55/57 behavior. Gate with focused tests, then run DGX md5/op gates and same-window MoE/dense threshold sweeps. + +**Tech Stack:** llama.cpp fork, `tools/server/server-admission-policy.h`, `tools/server/server-context.cpp`, DGX GB10, `h2h_cli.py`, `paged-inference-gates.sh`. + +--- + +### Task 1: Add waiting-threshold helper + +- [x] **Step 1: Write red test** + +Added helper expectations: + +- zero waiting threshold defers +- at waiting threshold defers +- below waiting threshold does not defer + +Observed red failure: no helper overload accepted the waiting-slot threshold +signature. + +- [x] **Step 2: Implement threshold helper and env** + +Added `LLAMA_TTFT_PREFILL_FIRST_MIN_WAITING`. The scheduler now counts prompt +slots in `SLOT_STATE_STARTED` or `SLOT_STATE_PROCESSING_PROMPT` before collecting +decode rows and only defers if the waiting count is at or above the threshold. + +- [x] **Step 3: Verify local** + +Commands passed: + +```bash +cmake --build build --target test-server-admission-policy test-server-admission-trace llama-server -j2 +./build/bin/test-server-admission-policy +ctest --test-dir build -R 'test-server-admission-(policy|trace)' --output-on-failure +``` + +- [x] **Step 4: Commit fork patch** + +Local fork commit: + +```text +8759213e3 feat(server): gate TTFT defer by prompt backlog +``` + +### Task 2: DGX gate and threshold sweep + +- [x] **Step 1: Preflight and build** + +Preflight: docker `0`, `local-ai-worker` `0`, compute `0`, lock +`FREE released-by-codex-phase57-cap 1782901003`, clean mirror at +`2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780`. + +Applied `/tmp/phase58-ttft-waiting-stack.patch`, built focused tests, +`llama-server`, `llama-cli`, and `test-backend-ops`. DGX focused CTests passed. + +- [x] **Step 2: Run pre/post gates** + +Artifact: `/home/mudler/bench/phase58_ttft_waiting_sweep/20260701_122052`. + +Pre and post gates matched: + +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5 `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT` `1146/1146` +- `MUL_MAT_ID` `806/806` + +- [x] **Step 3: Run MoE threshold sweep** + +MoE `n=128`, `ptok=128`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `339.0` | `648.4` | `1542.9` | `7743.1` | `11532.5` | `24.167` | `0` | +| min24 | `339.9` | `619.3` | `1637.0` | `7326.6` | `10868.8` | `24.095` | `323` | +| min32 | `341.9` | `635.0` | `1609.6` | `7420.1` | `11054.6` | `23.950` | `220` | +| min32+cap32 | `331.2` | `631.8` | `1512.1` | `7829.2` | `11767.1` | `24.733` | `140` | + +- [x] **Step 4: Run dense threshold sweep** + +Dense `n=128`, `ptok=168`, `gen=64`: + +| variant | agg t/s | decode agg t/s | prefill t/s | TTFT mean ms | TTFT max ms | wall s | deferred | +|---------|---------|-----------------|-------------|--------------|-------------|--------|----------| +| default | `140.3` | `362.7` | `639.8` | `21407.3` | `35811.6` | `58.399` | `0` | +| min24 | `140.4` | `347.6` | `658.7` | `22078.2` | `34783.3` | `58.353` | `420` | +| min32 | `139.7` | `350.2` | `650.1` | `21221.5` | `35246.3` | `58.642` | `386` | + +- [x] **Step 5: Revert DGX stack** + +Reverted the temporary patch stack, removed introduced files, and released the +lock as `FREE released-by-codex-phase58-waiting 1782901748`. + +### Task 3: Decision + +- [x] **Step 1: Record outcome** + +Decision: keep the threshold as the best selective TTFT-defer A/B so far, but +still opt-in. MoE min32 improved aggregate, mean/max TTFT, and wall in the same +window. Dense min32 was roughly neutral with a small TTFT gain but slight +aggregate/wall loss. Next step should repeat min32 and compare against vLLM h2h +before any default-on discussion. diff --git a/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md b/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md new file mode 100644 index 000000000000..35114f4b89ff --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md @@ -0,0 +1,98 @@ +# Phase49 vLLM Env Hygiene Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Keep vLLM benchmark logs clean by preventing harness-only `VLLM_*` variables from being inherited by the vLLM server process. + +**Architecture:** Add an `env -u ...` wrapper around the `vllm serve` command in `paged-current-serving-snapshot.sh`. Only unset harness-owned variables (`VLLM_MODEL`, `VLLM_BIN`, `VLLM_READY_ATTEMPTS`, `VLLM_GPU_MEMORY_UTILIZATION`, `VLLM_MAX_MODEL_LEN`, `VLLM_MAX_NUM_SEQS`, `VLLM_TENSOR_PARALLEL_SIZE`, `VLLM_EXTRA_ARGS`) and keep intentional vLLM runtime variables like `VLLM_LOGGING_LEVEL`. + +**Tech Stack:** Bash serving harness, LocalAI parity docs. + +--- + +### Task 1: Prove env scrubbing is absent + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Run red grep** + +```bash +grep -F 'env -u VLLM_MODEL' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `1`. + +### Task 2: Add vLLM child env scrub + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Wrap the vLLM command** + +Change: + +```bash +nohup "$VLLM_BIN" serve "$VLLM_MODEL" \ +``` + +to: + +```bash +nohup env \ + -u VLLM_MODEL -u VLLM_BIN -u VLLM_READY_ATTEMPTS \ + -u VLLM_GPU_MEMORY_UTILIZATION -u VLLM_MAX_MODEL_LEN -u VLLM_MAX_NUM_SEQS \ + -u VLLM_TENSOR_PARALLEL_SIZE -u VLLM_EXTRA_ARGS \ + "$VLLM_BIN" serve "$VLLM_MODEL" \ +``` + +### Task 3: Verify + +**Files:** +- Test: `backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh` + +- [x] **Step 1: Shell syntax check** + +```bash +bash -n backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 2: Green grep** + +```bash +grep -F -- '-u VLLM_MODEL' backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`. + +- [x] **Step 3: DGX dry-run still passes** + +```bash +ssh dgx.casa 'set -euo pipefail; ART=$HOME/bench/phase49_vllm_env_hygiene_dryrun/$(date +%Y%m%d_%H%M%S); SRC=$HOME/llama-phase6-source BUILD_DIR=$HOME/llama-phase6-source/build-phase36 BIN=$HOME/llama-phase6-source/build-phase36/bin MODEL=$HOME/bench/q36-27b-nvfp4.gguf VLLM_MODEL=$HOME/bench/q36-27b-nvfp4-vllm SERVED_MODEL_NAME=dense-q36 ART=$ART NPL="1" PARALLEL=1 CTX=4096 PTOK=16 GEN=4 DRY_RUN=1 VLLM_READY_ATTEMPTS=700 OPS=MUL_MAT,MUL_MAT_ID bash -s' < backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh +``` + +Expected: exit `0`, clean preflight, and dry-run output still prints `VLLM_READY_ATTEMPTS=700`. + +### Task 4: Record and commit + +**Files:** +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- Modify: `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` +- Modify: `docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md` + +- [x] **Step 1: Record Phase49** + +Record the dry-run artifact and state that this is log hygiene only. + +- [x] **Step 2: Final checks and commit** + +```bash +git diff --check +git add backend/cpp/llama-cpp-localai-paged/paged-current-serving-snapshot.sh \ + backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md \ + backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-vllm-env-hygiene-phase49.md +git commit -m "fix(paged): scrub harness vars for vllm serve" -m "Assisted-by: Codex:gpt-5" +``` diff --git a/docs/superpowers/plans/2026-07-01-w4a16-current-profile-phase60.md b/docs/superpowers/plans/2026-07-01-w4a16-current-profile-phase60.md new file mode 100644 index 000000000000..e61a6efa426c --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-w4a16-current-profile-phase60.md @@ -0,0 +1,81 @@ +# Phase 60: Current W4A16 Prefill Profile + +## Goal + +Re-profile the current clean W4A16 grouped MoE prefill path after the Phase1-5 +W4A16 work, then decide whether another low-conflict W4A16 patch is justified. + +## Artifact + +- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915` + +## Source State + +- DGX mirror: `~/llama-phase6-source` +- Branch: `localai-paged` +- Commit: `2cbb61969443cf52aa1aa58eb9f5a8d7c20a7780` + +## Gates + +| phase | MoE md5 | dense md5 | `MUL_MAT` | `MUL_MAT_ID` | +|-------|---------|-----------|-----------|--------------| +| pre | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | +| post | `8cb0ce23777bf55f92f63d0292c756b0` | `5951a5b4d624ce891e22ab5fca9bc439` | `1146/1146` | `806/806` | + +DGX cleanup: + +- Docker containers: `0` +- GPU compute apps: `0` +- Lock released: `FREE phase60-cleanup 20260701T105438Z` + +## End-to-End A/B + +MoE `llama-batched-bench`, `npl=32`, `ntg=4`, `npp=512,2048`: + +| path | PP | S_PP t/s | T_PP s | S_TG t/s | total S t/s | +|------|----|----------|--------|----------|-------------| +| default FP4-MMQ | `512` | `2327.69` | `7.039` | `399.87` | `2243.83` | +| default FP4-MMQ | `2048` | `2423.20` | `27.045` | `391.58` | `2398.94` | +| forced W4A16 | `512` | `1451.00` | `11.291` | `319.32` | `1412.21` | +| forced W4A16 | `2048` | `1482.76` | `44.199` | `303.40` | `1471.61` | + +Forced W4A16 remains: + +- `0.623x` default FP4-MMQ at `npp=512` (`-37.7%` S_PP). +- `0.612x` default FP4-MMQ at `npp=2048` (`-38.8%` S_PP). + +## `npp=512` Kernel Summary + +Default FP4-MMQ top rows: + +| bucket | time % | total time | +|--------|--------|------------| +| `mul_mat_q` | `39.2%` | `2.712s` | +| `gated_delta_net_chunked_cuda` | `12.2%` | `0.843s` | +| `quantize_mmq_nvfp4` | `4.5%` | `0.314s` | + +Forced W4A16 top rows: + +| bucket | time % | total time | +|--------|--------|------------| +| `w4a16_grouped_kernel<32,128,1,4,2>` | `42.5%` | `4.142s` | +| `k_get_rows_float` | `11.2%` | `1.094s` | +| `gated_delta_net_chunked_cuda` | `8.6%` | `0.838s` | +| `w4a16_cast_act_f32_bf16` | `5.3%` | `0.517s` | +| residual `quantize_mmq_nvfp4` | `1.4%` | `0.132s` | + +## Decision + +Reject another small W4A16 body/metadata/cast tweak as the next parity phase. + +The current W4A16 path avoids most activation quantization, but the grouped +kernel is still `1.53x` slower than default MMQ's main `mul_mat_q` bucket at +`npp=512` (`4.142s` versus `2.712s`) and sorted activation gathers add another +`1.094s`. Eliminating the cast kernel entirely would recover only `5.3%` of the +forced-W4A16 profile and would not close the `37-39%` end-to-end S_PP loss. + +Next W4A16 work would need a larger redesign that both improves the grouped +kernel body and removes or fuses the sorted activation gather. That is outside +the low-conflict incremental patch track. For near-term parity work, return to +the broader prefill/GDN/MoE design track or a hardware-pivot benchmark rather +than another W4A16 micro-patch. diff --git a/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md new file mode 100644 index 000000000000..5734ddb6289c --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md @@ -0,0 +1,61 @@ +# W4A16 Direct-Activation Phase61 Result + +Verdict: rejected. + +The default-off direct-A kernel was implemented and gated, but it failed the +performance keep gate. The rejected local diff was saved at: + +- `/tmp/phase61-w4a16-direct-a-rejected.diff` + +The llama.cpp fork keeps only the safe routing stub: + +- `41be3da5b test(cuda): cover W4A16 direct activation policy` +- `7967ad47f feat(cuda): route W4A16 direct activation stub` + +## Correctness + +Default inference gates: + +- Artifact: `/home/mudler/bench/phase61_direct_default_gates/20260701_132057` +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` +- `MUL_MAT_ID`: `806/806` + +Forced direct-A op gate: + +- Initial direct kernel: `794/806`, failed only `b=1` NVFP4 cases. +- Root cause: `ids_to_sorted` is a flat source-row index for `get_rows_cuda`, + not a `(token, expert-slot)` pair. +- Fixed direct load: `src_base = src1 + src_row*nb11`. +- Final direct gate: `806/806`. + +Opt-in transcript check: + +- Artifact: `/home/mudler/bench/phase61_direct_ab/20260701_132237` +- forced W4A16 MoE md5: `07db32c2bcb78d17a43ed18bc22705cd` +- direct-A MoE md5: `07db32c2bcb78d17a43ed18bc22705cd` +- forced and direct-A transcripts were byte-identical. + +## Performance + +MoE prefill, `npl=32`, `ntg=4`: + +| path | npp512 S_PP | npp2048 S_PP | +|------|-------------|--------------| +| default FP4-MMQ | `2325.45` | `2423.18` | +| forced W4A16 | `1471.05` | `1502.46` | +| forced W4A16 direct-A | `1566.30` | `1605.82` | + +Direct-A improved forced W4A16 by `+6.5%` at `npp=512` and `+6.9%` at +`npp=2048`. It reached only `0.67x` and `0.66x` of default FP4-MMQ. + +The keep gate required at least `+12%` over forced W4A16 and at least `0.75x` +of default FP4-MMQ. Phase61 failed both thresholds. + +## Decision + +Do not commit the direct-A kernel. Do not continue W4A16 body tuning as the next +GB10 parity lever. The sorted activation gather and cast were real overhead, but +removing them is not enough: the W4A16 grouped kernel body remains too slow +relative to default FP4-MMQ on GB10. diff --git a/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md new file mode 100644 index 000000000000..1f7406756d43 --- /dev/null +++ b/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61.md @@ -0,0 +1,700 @@ +# W4A16 Direct-Activation Phase61 Implementation Plan + +> **For agentic workers:** REQUIRED SUB-SKILL: Use superpowers:subagent-driven-development (recommended) or superpowers:executing-plans to implement this plan task-by-task. Steps use checkbox (`- [ ]`) syntax for tracking. + +**Goal:** Build and gate a default-off W4A16 grouped MoE prefill experiment that removes the measured sorted activation gather and separate f32-to-bf16 cast overhead from Phase60. + +**Architecture:** Keep the existing default path unchanged. Add a second W4A16 grouped kernel mode behind `LLAMA_W4A16_DIRECT_A=1` that consumes the original `src1` activation tensor and the existing `ids_to_sorted` map directly, converting f32 activations to bf16 while loading A into shared memory. This tests the measured Phase60 hypothesis before any larger grouped-kernel body rewrite. + +**Tech Stack:** llama.cpp fork (`/home/mudler/_git/llama.cpp`), ggml CUDA, CMake, `test-backend-ops`, DGX GB10 gates, LocalAI docs. + +--- + +## Evidence + +Phase60 artifact: + +- `/home/mudler/bench/phase60_w4a16_current_profile/20260701_104915` + +At MoE `npp=512`, forced W4A16 spent: + +- `4.142s` in `w4a16_grouped_kernel<32,128,1,4,2>` +- `1.094s` in `k_get_rows_float` sorted activation gathers +- `0.517s` in `w4a16_cast_act_f32_bf16` + +Default FP4-MMQ spent: + +- `2.712s` in `mul_mat_q` +- `0.314s` in `quantize_mmq_nvfp4` + +The direct-activation experiment can at most remove the `1.611s` gather+cast tax before kernel-body effects. It is a kill-gate experiment: if forced W4A16 remains far behind default MMQ, stop this branch and do not tune the W4A16 body again. + +## Files + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` + - Add declarations for a direct-activation engagement helper and direct kernel launcher. +- Create: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-policy.h` + - Pure C++ policy helper for direct-mode engagement tests. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + - Add `LLAMA_W4A16_DIRECT_A` parsing. + - Use the pure direct-mode route helper. + - Add a direct-activation kernel variant that uses `ids_to_sorted` and original `src1` strides. +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + - Pass `ids_to_sorted`, original `src1` data pointer, and source strides to the direct launcher. + - Skip `get_rows_cuda` and `w4a16_cast_act_f32_bf16` only in direct mode. +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + - Register a small policy unit test. +- Create: `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp` + - Unit-test direct-mode engagement logic without requiring CUDA execution. +- Create after DGX run: `/home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention/docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md` + - Record gates, A/B, decision, and whether to keep or revert the fork patch. + +## Task 1: Add Red Policy Tests + +**Files:** + +- Create: `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp` +- Modify: `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` + +- [x] **Step 1: Add the failing policy test file** + +Create `/home/mudler/_git/llama.cpp/tests/test-cuda-w4a16-policy.cpp`: + +```cpp +#include "ggml-cuda/w4a16-policy.h" + +#include +#include + +static void test_direct_a_requires_master_w4a16() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/0, + /*direct_a=*/true, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +static void test_direct_a_requires_direct_flag() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/1, + /*direct_a=*/false, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +static void test_direct_a_engages_for_large_nvfp4_moe_prefill_shape() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/1, + /*direct_a=*/true, + /*tokens=*/512, + /*k=*/2048, + /*n=*/512); + assert(ok); +} + +static void test_direct_a_rejects_decode_sized_shape() { + const bool ok = ggml_cuda_w4a16_direct_a_should_engage_params( + GGML_TYPE_NVFP4, GGML_TYPE_F32, GGML_TYPE_F32, + /*blackwell=*/true, + /*w4a16_prefill_m=*/128, + /*direct_a=*/true, + /*tokens=*/128, + /*k=*/2048, + /*n=*/512); + assert(!ok); +} + +int main() { + test_direct_a_requires_master_w4a16(); + test_direct_a_requires_direct_flag(); + test_direct_a_engages_for_large_nvfp4_moe_prefill_shape(); + test_direct_a_rejects_decode_sized_shape(); + std::puts("test-cuda-w4a16-policy: OK"); + return 0; +} +``` + +- [x] **Step 2: Register the test** + +Add to `/home/mudler/_git/llama.cpp/tests/CMakeLists.txt` near the other small C++ tests: + +```cmake +llama_build(test-cuda-w4a16-policy.cpp) +target_include_directories(test-cuda-w4a16-policy PRIVATE ${PROJECT_SOURCE_DIR}/ggml/src) +llama_test(test-cuda-w4a16-policy) +``` + +- [x] **Step 3: Verify RED** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy -j2 +``` + +Expected: build fails because the direct-A policy surface is missing. + +Actual RED: after reconfiguring the build tree, the target failed to compile +because `ggml-cuda/w4a16-policy.h` did not exist. + +## Task 2: Add Direct-A Policy Helper + +**Files:** + +- Create: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-policy.h` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] **Step 1: Create the pure policy helper** + +Create `w4a16-policy.h`: + +```cpp +#pragma once + +#include "ggml.h" + +#include + +static inline bool ggml_cuda_w4a16_direct_a_should_engage_params( + ggml_type src0_type, + ggml_type src1_type, + ggml_type dst_type, + bool blackwell, + int64_t w4a16_prefill_m, + bool direct_a, + int64_t tokens, + int64_t k, + int64_t n) { + if (!direct_a || w4a16_prefill_m <= 0) { + return false; + } + if (src0_type != GGML_TYPE_NVFP4 || src1_type != GGML_TYPE_F32 || dst_type != GGML_TYPE_F32) { + return false; + } + if (!blackwell || tokens <= w4a16_prefill_m) { + return false; + } + return k % 64 == 0 && n % 128 == 0; +} +``` + +- [x] **Step 2: Declare and implement the env helper** + +Add to `w4a16-gemm.cuh`: + +```cpp +bool ggml_cuda_w4a16_direct_a_enabled(); +``` + +Add to `w4a16-gemm.cu` near `ggml_cuda_w4a16_prefill_enabled()`: + +```cpp +bool ggml_cuda_w4a16_direct_a_enabled() { + static const bool enabled = [] { + const char * e = getenv("LLAMA_W4A16_DIRECT_A"); + return e != nullptr && atoi(e) != 0; + }(); + return enabled; +} +``` + +- [x] **Step 3: Verify GREEN for policy test** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: + +```text +test-cuda-w4a16-policy: OK +``` + +Actual local verification: + +```bash +cd /home/mudler/_git/llama.cpp +cmake -B build -S . +cmake --build build --target test-cuda-w4a16-policy -j2 +ctest --test-dir build -R '^test-cuda-w4a16-policy$' --output-on-failure +``` + +Result: `100% tests passed, 0 tests failed out of 1`. + +Actual DGX CUDA compile verification: + +```text +test-cuda-w4a16-policy: OK +[ 7%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/w4a16-gemm.cu.o +[100%] Built target test-cuda-w4a16-policy +``` + +Fork commit: + +- `41be3da5b test(cuda): cover W4A16 direct activation policy` + +## Task 3: Add Direct-A Kernel Launcher Skeleton + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cuh` +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] **Step 1: Declare the direct launcher** + +Add to `w4a16-gemm.cuh`: + +```cpp +void ggml_cuda_mul_mat_id_w4a16_grouped_direct_a( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, + const float * src1, + const int32_t * ids_to_sorted, + float * dst_sorted, + const int * tokens_per_expert, + int64_t n_experts, + int64_t n_expert_used, + int64_t k, + int64_t n, + size_t src1_nb1, + size_t src1_nb2, + cudaStream_t stream); +``` + +- [x] **Step 2: Add a stub that preserves behavior** + +Add to `w4a16-gemm.cu` after `ggml_cuda_mul_mat_id_w4a16_grouped()`: + +```cpp +[[noreturn]] void ggml_cuda_mul_mat_id_w4a16_grouped_direct_a( + ggml_backend_cuda_context & ctx, + const ggml_tensor * src0, + const float * src1, + const int32_t * ids_to_sorted, + float * dst_sorted, + const int * tokens_per_expert, + int64_t n_experts, + int64_t n_expert_used, + int64_t k, + int64_t n, + size_t src1_nb1, + size_t src1_nb2, + cudaStream_t stream) { + GGML_UNUSED(ctx); + GGML_UNUSED(src0); + GGML_UNUSED(src1); + GGML_UNUSED(ids_to_sorted); + GGML_UNUSED(dst_sorted); + GGML_UNUSED(tokens_per_expert); + GGML_UNUSED(n_experts); + GGML_UNUSED(n_expert_used); + GGML_UNUSED(k); + GGML_UNUSED(n); + GGML_UNUSED(src1_nb1); + GGML_UNUSED(src1_nb2); + GGML_UNUSED(stream); + GGML_ABORT("LLAMA_W4A16_DIRECT_A selected before direct-A kernel implementation"); +} +``` + +- [x] **Step 3: Verify build still passes** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: test passes and `llama-batched-bench` builds. + +Actual local verification: + +```bash +cd /home/mudler/_git/llama.cpp +git diff --check +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Result: `test-cuda-w4a16-policy: OK`. + +Actual DGX CUDA compile verification: + +```text +[ 10%] Building CUDA object ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/w4a16-gemm.cu.o +[100%] Built target llama-batched-bench +test-cuda-w4a16-policy: OK +``` + +Remote mirror cleanup: `/tmp/localai-gpu.lock` released as +`FREE phase61-noreturn-compile 20260701T111354Z`. + +## Task 4: Route Direct-A Mode Without Touching Default Path + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu` + +- [x] **Step 1: Add direct-mode branch** + +In `ggml_cuda_mul_mat_id`, after `ids_to_sorted` and `ids_from_sorted` are prepared, replace the W4A16 branch with this structure: + +```cpp + const bool use_w4a16_direct_a = ggml_cuda_w4a16_direct_a_should_engage_params( + src0->type, src1->type, dst->type, + blackwell_mma_available(cc), + ggml_cuda_w4a16_prefill_m(), + ggml_cuda_w4a16_direct_a_enabled(), + ne12, + ne10, + ne0); + + if (use_w4a16_direct_a) { + ggml_cuda_mul_mat_id_w4a16_grouped_direct_a(ctx, src0, + (const float *) src1->data, ids_to_sorted, (float *) dst_sorted.ptr, + tokens_per_expert.data(), ne02, n_expert_used, ne10, ne0, + nb11, nb12, stream); + } else { + get_rows_cuda(src1->data, src1->type, ids_to_sorted, src1_sorted.ptr, type_src1_sorted, + ne10, nb11, nb12, nb13, + ne_get_rows, 1, 1, sizeof(int32_t), ne_get_rows*sizeof(int32_t), ne_get_rows*sizeof(int32_t), + ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, ne_get_rows*ne10*ts_src1_sorted, stream); + CUDA_CHECK(cudaGetLastError()); + + if (ggml_cuda_w4a16_moe_grouped_should_engage(src0, src1, dst, cc)) { + ggml_cuda_mul_mat_id_w4a16_grouped(ctx, src0, + (const float *) src1_sorted.ptr, (float *) dst_sorted.ptr, + tokens_per_expert.data(), ne02, ne10, ne0, stream); + } else { + // existing per-expert loop remains here unchanged + } + } +``` + +Do not leave two `get_rows_cuda` calls in the direct path. + +- [x] **Step 2: Verify default path** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: build and policy test pass. Do not run `LLAMA_W4A16_DIRECT_A=1` yet; the stub must abort if selected. + +Actual local verification: + +```bash +cd /home/mudler/_git/llama.cpp +git diff --check +cmake --build build --target test-cuda-w4a16-policy llama-batched-bench -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Result: `test-cuda-w4a16-policy: OK`. + +Actual DGX default inference safety gates with the Task 3/4 cumulative patch +applied to `~/llama-phase6-source`: + +- Artifact: `/home/mudler/bench/phase61_task34_gates/20260701_131210` +- MoE md5: `8cb0ce23777bf55f92f63d0292c756b0` +- dense md5: `5951a5b4d624ce891e22ab5fca9bc439` +- `MUL_MAT`: `1146/1146` +- `MUL_MAT_ID`: `806/806` +- Remote mirror cleanup: `/tmp/localai-gpu.lock` released as + `FREE phase61-task34-gates 20260701T111317Z`. + +Fork commit: + +- `7967ad47f feat(cuda): route W4A16 direct activation stub` + +## Task 5: Implement Direct-A Kernel + +**Files:** + +- Modify: `/home/mudler/_git/llama.cpp/ggml/src/ggml-cuda/w4a16-gemm.cu` + +- [x] **Step 1: Add the direct kernel variant** + +Copy `w4a16_grouped_kernel` into a new template named `w4a16_grouped_direct_a_kernel`. Change only the A-load section: + +```cpp +const int32_t src_row = ids_to_sorted[row0 + r]; +const int64_t token = src_row / n_expert_used; +const int64_t slot = src_row - token * n_expert_used; +const char * src_base = ((const char *) src1) + token * src1_nb2 + slot * src1_nb1; +const float * src = (const float *) (src_base + (int64_t) kt * BK * sizeof(float) + c * 8 * sizeof(float)); +``` + +Load eight f32 values, convert to bf16, and store into `sA[st]`. This replaces the old `cp.async` A load because the source conversion is no longer a raw copy: + +```cpp +nv_bfloat16 tmp[8]; +#pragma unroll +for (int q = 0; q < 8; ++q) { + tmp[q] = __float2bfloat16(src[q]); +} +uint4 packed = *reinterpret_cast(tmp); +*reinterpret_cast(((char *) sA[st]) + (r*ASTR + c*4)*sizeof(uint32_t)) = packed; +``` + +Keep W `cp.async` unchanged. + +- [x] **Step 2: Wire the direct launcher to the new kernel** + +Replace the stub body with a launcher that mirrors `ggml_cuda_mul_mat_id_w4a16_grouped_impl`, but: + +- does not allocate `Abf`; +- does not call `w4a16_cast_act_f32_bf16`; +- passes `src1`, `ids_to_sorted`, `n_expert_used`, `src1_nb1`, and `src1_nb2` into the direct kernel. + +- [x] **Step 3: Build** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +cmake --build build --target test-cuda-w4a16-policy test-backend-ops llama-batched-bench llama-completion -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Expected: build succeeds and policy test passes. + +Actual implementation note: the first direct kernel decoded `ids_to_sorted` as +`token = src_row / n_expert_used` and `slot = src_row % n_expert_used`. That was +wrong for `b=1` backend-op shapes. The existing `get_rows_cuda` call treats +`ids_to_sorted` as a flat row index and addresses `src1 + src_row*nb11`, so the +working direct kernel used the same flat-row addressing. The first forced +direct-A gate failed `794/806`; the flat-row fix passed `806/806`. + +Actual local build: + +```bash +cd /home/mudler/_git/llama.cpp +git diff --check +cmake --build build --target test-cuda-w4a16-policy -j2 +./build/bin/test-cuda-w4a16-policy +``` + +Result: `test-cuda-w4a16-policy: OK`. + +## Task 6: Local CUDA Correctness Gate + +**Files:** none. + +- [x] **Step 1: Run forced W4A16 direct-A op gate** + +Run on a CUDA host: + +```bash +cd /home/mudler/_git/llama.cpp +LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_DIRECT_A=1 ./build/bin/test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 +``` + +Expected: `806/806 tests passed`. + +Actual RED before implementation: abort at +`LLAMA_W4A16_DIRECT_A selected before direct-A kernel implementation`, as +expected. + +Actual GREEN on DGX after flat-row fix: + +- `LLAMA_W4A16_PREFILL_M=1 LLAMA_W4A16_DIRECT_A=1 test-backend-ops ... MUL_MAT_ID` +- Result: `806/806 tests passed`, `Backend CUDA0: OK`. +- Cleanup lock: `FREE phase61-direct-kernel-gate2 20260701T112013Z`. + +- [x] **Step 2: Run default op gate** + +Run: + +```bash +cd /home/mudler/_git/llama.cpp +./build/bin/test-backend-ops test -b CUDA0 -o MUL_MAT_ID -j 1 +``` + +Expected: `806/806 tests passed`. + +Actual default-path gate was run as part of the full default inference gate in +Task 7: `MUL_MAT_ID` `806/806`. + +## Task 7: DGX Inference and Performance Gate + +**Files:** none. + +- [x] **Step 1: Preflight DGX** + +Run: + +```bash +ssh dgx.casa 'echo docker=$(docker ps -q | wc -l); echo compute=$(nvidia-smi --query-compute-apps=pid --format=csv,noheader | sed "/^$/d" | wc -l); cat /tmp/localai-gpu.lock 2>/dev/null || true; pgrep -af "[l]ocal-ai-worker|[v]llm|[l]lama-server" || true' +``` + +Expected: Docker `0`, compute `0`, lock `FREE*`, and no worker/server process. + +Actual: DGX checks were clean before the phase, and each run acquired/released +`/tmp/localai-gpu.lock`. + +- [x] **Step 2: Apply patch to clean DGX mirror and build** + +Use the fork diff for this one patch only, apply it to `~/llama-phase6-source`, and build `build-cuda`. Do not leave the DGX mirror dirty after the phase. + +Actual: the cumulative fork diff was applied to `~/llama-phase6-source` for each +DGX gate and reverted by cleanup traps. The final mirror status was clean. + +- [x] **Step 3: Run pre gates** + +Run the canonical MoE/dense md5 and `MUL_MAT`/`MUL_MAT_ID` gates: + +```bash +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GGML_NO_BACKTRACE=1 ./llama-completion -m /home/mudler/bench/q36-35b-a3b-nvfp4.gguf -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 /tmp/phase61-w4a16-direct-a-rejected.diff +git restore ggml/src/ggml-cuda/w4a16-gemm.cuh ggml/src/ggml-cuda/w4a16-gemm.cu ggml/src/ggml-cuda/ggml-cuda.cu tests/CMakeLists.txt +rm -f ggml/src/ggml-cuda/w4a16-policy.h tests/test-cuda-w4a16-policy.cpp +git status --short +``` + +Actual: saved rejected local diff to +`/tmp/phase61-w4a16-direct-a-rejected.diff` and reverted it. The fork remains at +committed routing-stub HEAD `7967ad47f`; the direct kernel implementation was +not committed. + +- [x] **Step 3: Update LocalAI docs** + +Create `docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md` with the artifact path, gate table, A/B table, and keep/reject decision. Update: + +- `backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md` +- `backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md` +- `backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md` + +Actual: created the Phase61 result file and updated the three parity docs with +the reject decision, artifacts, md5/op gates, A/B table, and direct-A +flat-row-addressing correction. + +- [x] **Step 4: Commit LocalAI docs** + +Run: + +```bash +cd /home/mudler/_git/LocalAI/.claude/worktrees/feat+paged-attention +git add backend/cpp/llama-cpp-localai-paged/docs/GB10_PARITY_PHASE0_RESULTS.md backend/cpp/llama-cpp-localai-paged/docs/VLLM_PARITY_LEVER_MAP.md backend/cpp/llama-cpp-localai-paged/docs/PARITY_HANDOFF.md +git add -f docs/superpowers/plans/2026-07-01-w4a16-direct-activation-phase61-result.md +git commit -m "docs(paged): record W4A16 direct activation phase" -m "Assisted-by: Codex:gpt-5" +``` + +## Self-Review + +- Spec coverage: The plan addresses Phase60's measured W4A16 sorted-gather and cast overhead before any grouped-kernel-body rewrite. +- Placeholder scan: No `TBD`, `TODO`, or unspecified test commands remain. +- Type consistency: Helper names use the `ggml_cuda_w4a16_direct_a_*` prefix consistently across declaration, test, implementation, and route branch. diff --git a/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md b/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md new file mode 100644 index 000000000000..6ac3ea530705 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-gdn-global-ai-prototype-design.md @@ -0,0 +1,97 @@ +# GDN Global-Ai Prototype Design + +## Goal + +Prototype the only remaining plausible C32 GDN prefill path on GB10: compute +the per-chunk triangular inverse once into global f32 Ai scratch, then reuse it +from two `dv_tile=64` value-slab CTAs. + +## Scope + +The prototype is default-off and intentionally narrow: + +- `S_v=128` +- `BT=32` +- f32 Ai scratch +- two `dv_tile=64` value slabs +- non-KDA, final-state-only path matching the existing chunked M5 conditions +- no decode routing; `GDN_CHUNK_MIN` remains greater than 1 + +## Architecture + +The prototype splits current M5 work into two CUDA stages: + +1. `gdn_ai32_cuda`: one CTA per `(sequence, head, chunk)` computes the C32 + chunk-local triangular inverse `Ai = A^-1` and writes `[BT, BT]` f32 scratch. +2. `gdn_chunked_ai32_cuda`: one CTA per `(sequence, head, value slab)` loads Ai + for each chunk and performs the value-dependent work for its 64 output + columns. + +This mirrors the portable scheduling idea from vLLM/FLA without importing +CuteDSL, TMA, or BF16 storage. It directly tests whether sharing A/Ai across +slabs can beat the duplicated work that rejected Phase 10. + +## Scratch + +Ai scratch is sized: + +```text +n_seqs * H * ceil(n_tokens / 32) * 32 * 32 * sizeof(float) +``` + +At `npp=2048,npl=32`, this is: + +- MoE H=32: 256 MiB. +- Dense H=48: 384 MiB. + +Scratch allocation must use the existing ggml CUDA pool, be scoped to the op, +and be default-off behind an explicit env selector. + +## Selector + +Use: + +```text +GDN_GLOBAL_AI32=1 +``` + +The default path remains current C16 M5. The candidate only engages when: + +- `S_v == 128` +- `n_tokens >= GDN_CHUNK_MIN` +- `!KDA && !keep_rs_t` +- `GDN_GLOBAL_AI32=1` + +## Correctness + +The first implementation uses f32 Ai to maximize chances of md5 stability. It +must pass: + +- `test-backend-ops -b CUDA0 -o GATED_DELTA_NET` +- MoE md5 `8cb0ce23777bf55f92f63d0292c756b0` +- Dense md5 `5951a5b4d624ce891e22ab5fca9bc439` + +If md5 changes, the prototype must stop for KL before any performance claim. + +## Performance + +Compare same-session against current M5: + +```text +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 +``` + +versus: + +```text +LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1 GDN_TC=5 GDN_CHUNK_MIN=64 GDN_GLOBAL_AI32=1 +``` + +Run MoE and dense at `npp=512,2048`, `ntg=4`, `npl=32`. + +## Decision Rule + +Accept only if the prototype is correctness-safe and improves end-to-end S_PP. +Reject if it is flat or slower. If rejected, save the diff under +`/home/mudler/bench/phase13_gdn_global_ai32/rejected/` and do not add a LocalAI +patch. diff --git a/docs/superpowers/specs/2026-07-01-gdn-m5-state-boundary-design.md b/docs/superpowers/specs/2026-07-01-gdn-m5-state-boundary-design.md new file mode 100644 index 000000000000..07fb5c40b639 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-gdn-m5-state-boundary-design.md @@ -0,0 +1,78 @@ +# GDN M5 State-Boundary Design + +## Context + +Phase 10 tested a default-off C32 slabbed M5 path for +`ggml/src/ggml-cuda/gated_delta_net.cu`. It was correctness-clean only after +zeroing staged tail rows, then failed the performance gate: + +| Model | PP | M5 S_PP t/s | C32 slab S_PP t/s | +|-------|----|-------------|-------------------| +| MoE | 2048 | 2430.32 | 2054.86 | +| Dense | 2048 | 1019.25 | 903.73 | + +The likely root cause is duplicated A/T work per value slab. vLLM/FLA computes +the per-chunk triangular object once and reuses it through the WY transform; the +two-slab M5 shortcut could not do that without a larger scratch/precompute +design. + +## Design Choice + +Phase 11 stays at the shipped C=16 M5 geometry and tests a smaller C=16 +state-boundary variant before reopening chunk-size changes. The candidate +targets the two tensor-core state-boundary products that both multiply a chunk +matrix by the same pre-update state `Sd`: + +- `KS = Kc * S0`, currently used to form `Ud`. +- `QS = Qc * S0`, currently deposited later as the cross-chunk output term. + +The first implementation should be default-off and selected by an explicit env +var such as `GDN_M5_QS_EARLY=1`. It should not change the default `GDN_TC=5` +path until it clears correctness and performance gates. + +## Candidate Shape + +The low-conflict version moves the QS state-boundary pass earlier in the C=16 +M5 chunk loop and stores `gamma_t * QS[t][j]` in `attn_base` before the solve. +The later output section then reuses the predeposited cross-chunk term exactly +as it does today. + +This does not yet fuse K and Q in a single MMA instruction. It tests whether +moving QS earlier and tightening the state-boundary scheduling helps without +new global scratch, changed state ownership, or C32 slab duplication. If it is +flat, Phase 11 should be rejected quickly and the next GDN work should be a +larger shared-A/Ai design rather than more local scheduling. + +## Non-Goals + +- Do not reintroduce C32 slabs. +- Do not add global A/Ai scratch in this phase. +- Do not import vLLM CuteDSL/TMA kernels. +- Do not route decode into chunked GDN; keep `GDN_CHUNK_MIN > 1`. +- Do not change default inference behavior unless gates prove a win. + +## Gates + +Correctness: + +- Build `test-backend-ops` and `llama-completion` on DGX. +- Run default and forced-candidate `GATED_DELTA_NET` CUDA0 gates. +- Run canonical MoE and dense greedy md5 gates: + - MoE: `8cb0ce23777bf55f92f63d0292c756b0`. + - Dense: `5951a5b4d624ce891e22ab5fca9bc439`. +- If md5 changes, stop and run the existing KL gate before any performance + claim. + +Performance: + +- Compare against current M5 with `GDN_TC=5 GDN_CHUNK_MIN=64`. +- Run MoE and dense `llama-batched-bench` at `npp=512,2048`, `ntg=4`, + `npl=32`. +- Reject if either model regresses outside noise or if the GDN bucket does not + improve in a profile-gated follow-up. + +## Decision Rule + +Accept only if the candidate is md5/KL-safe and improves end-to-end S_PP. If it +is flat or slower, record it as rejected and move to a larger shared-A/Ai +blocked-solve design, likely requiring a separate scratch/precompute phase. diff --git a/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md b/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md new file mode 100644 index 000000000000..165dbe049214 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-gdn-shared-ai-cost-model-design.md @@ -0,0 +1,108 @@ +# GDN Shared-A/Ai Cost Model Design + +## Context + +The last two GDN experiments closed the low-conflict shortcut space: + +- Phase 10 C32 slab M5 was md5-clean after tail-row zeroing but slower because + each value slab recomputed the per-chunk triangular work. +- Phase 11 QS-early M5 was md5-clean but still slower because moving `QS` did + not remove a tensor-core pass. + +The remaining algorithmic gap to vLLM/FLA is not another local reorder. vLLM +builds the per-chunk triangular object once, solves/inverts it once, and reuses +that result across the WY transform. llama.cpp's current C=16 M5 already +computes A/T once for the full value width inside one CTA. A wider chunk only +fits on GB10 if value columns are split into slabs, and slabs lose unless A/T +is shared across them. + +## Current Geometry + +For `S_v = 128` and f32 state: + +| Shape | Dynamic smem | +|-------|--------------| +| C16 full value width | 93,376 B / 91.19 KiB | +| C32 full value width | 127,360 B / 124.38 KiB | +| C32 with `dv_tile=64` plus U staging | 94,592 B / 92.38 KiB | + +GB10's available dynamic smem leaves enough room for C16 full-width and C32 +half-width, but not for C32 full-width. That makes a shared-A/Ai design the only +plausible C32 path. + +## Candidate Approaches + +### A. Global A/Ai Scratch Precompute + +Add a first kernel that computes `A` and `Ai` once per `(sequence, head, chunk)` +and materializes `Ai` in global scratch. A second kernel consumes `Ai` across +value slabs. + +Pros: + +- Directly targets the Phase 10 failure mode. +- Mirrors the portable part of vLLM/FLA's schedule. +- Keeps each value-slab CTA within the GB10 smem limit. + +Cons: + +- Adds at least one extra kernel boundary. +- Requires scratch allocation and lifetime management in ggml CUDA. +- Scratch is large at real batch sizes. At `npl=32`, `BT=32`, f32 Ai costs: + - H=40, T=2048: 320 MiB. + - H=48, T=2048: 384 MiB. + - H=64, T=2048: 512 MiB. +- Needs careful profiling because global scratch traffic can erase the saved + triangular recomputation. + +### B. Shared A/Ai Inside One CTA With Reduced State Residency + +Keep C32 in one CTA by moving some state or value scratch out of shared memory. + +Pros: + +- Avoids global Ai scratch and cross-kernel synchronization. +- Could keep the current single-kernel structure. + +Cons: + +- The f32 state alone is 64 KiB. Removing enough shared memory for C32 full + width likely means reading state from global during MMA tiles or reducing + state residency, which attacks the current M5 strength. +- Higher risk of lowering achieved bandwidth and breaking md5 via new ordering. + +### C. Stay C16 and Stop GDN Kernel Work on GB10 + +Accept C16 M5 as the local GB10 ceiling and redirect parity work to another +bucket or different hardware. + +Pros: + +- Avoids high-risk scratch and synchronization work. +- Matches Phase 10/11 evidence that shortcuts are now exhausted. + +Cons: + +- Leaves the GDN prefill gap open. +- Does not move toward vLLM prefill parity on GB10. + +## Recommended Phase 12 + +Run a cost-model and dry-design phase before any source patch. The phase should +produce a go/no-go decision for Approach A: + +1. Extract actual GDN head counts and chunk counts for the MoE and dense GGUFs. +2. Compute scratch sizes for `BT=32` and `BT=64` at the benchmark shapes. +3. Estimate extra global traffic: Ai write + Ai read per value slab. +4. Compare that traffic against the triangular recomputation saved by sharing + A/Ai across slabs. +5. Only if the model is plausible, write a Phase 13 implementation plan for a + default-off global-scratch prototype. + +## Decision Rule + +Proceed to implementation only if the model shows a credible net win at +`npp=2048, npl=32` without unreasonable memory growth. If the estimated scratch +traffic or kernel-boundary overhead is close to the saved work, record a no-go +and stop GDN kernel work on GB10 rather than adding a large patch that is likely +to be rejected. diff --git a/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md b/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md new file mode 100644 index 000000000000..cf76ed130897 --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-mtp-rollback-serving-gates-design.md @@ -0,0 +1,89 @@ +# MTP Rollback and Serving Gates Design + +## Goal + +Move MTP speculative decoding from a smoke-only Phase 9 result to a gated +parity workstream by proving that Qwen3.6 hybrid recurrent state can be rolled +back safely under speculative rejection. + +This phase does not enable MTP by default and does not count MTP as a speed +win. It creates the evidence required before any serving benchmark can be +interpreted as valid. + +## Current Evidence + +Phase 9 proved that: + +- The MoE GGUF contains Qwen3.6 `nextn` tensors. +- `draft-mtp` can run with the current model after backend draft sampling is + disabled for MTP. +- Normal MoE and dense transcript md5 gates remain canonical. + +The missing proof is that speculative rejection restores both memory systems: + +- paged attention KV state, +- gated-DeltaNet recurrent state, including `n_rs_seq` snapshot rollback. + +## Existing Mechanism + +The current fork already contains the mechanism this phase should validate: + +- `common_params_speculative::need_n_rs_seq()` requests recurrent snapshots for + `draft-mtp` and `draft-eagle3`. +- Qwen3.5/Qwen3.6 architectures advertise recurrent rollback support through + `llm_arch_supports_rs_rollback()`. +- `llama_memory_recurrent::seq_rm()` can roll back within the bounded + `n_rs_seq` window by selecting an older recurrent-state snapshot. +- `tests/test-recurrent-state-rollback.cpp` verifies snapshot save/restore and + dirty-context cleanup for recurrent models. + +## Phase 14 Gates + +Phase 14 has three gates: + +1. **Rollback mechanism gate.** Build and run `test-recurrent-state-rollback` + against `/home/mudler/bench/q36-35b-a3b-nvfp4.gguf` on DGX. This proves the + actual model can restore recurrent snapshots and replay logits. +2. **MTP greedy-equivalence gate.** Run baseline greedy completion and MTP + speculative completion on the same prompt/seed and compare normalized raw + text. Exact transcript md5 is only valid when the same frontend emits the + same number of generated tokens. `llama-speculative-simple` commits accepted + token groups, so its output can be longer than `llama-completion -no-cnv` + for the same `-n`. Treat the gate as a safety pass only if one normalized + output is a prefix of the other and there is no first differing token. +3. **MTP partial-rejection gate.** Run an MTP configuration that drafts more + than one token and records `n_drafted > n_accept`, while still matching + greedy output. This proves rejection happened and did not corrupt + inferencing state. + +## Source Policy + +Do not add a production source patch in this phase unless one of the gates fails +and the root cause is isolated. If all gates pass, record the evidence and then +scope a separate serving/API benchmark phase. + +If a source patch is required, it must be fork-first, default-off or +test-only, and must pass: + +- MoE transcript md5 `8cb0ce23777bf55f92f63d0292c756b0`. +- Dense transcript md5 `5951a5b4d624ce891e22ab5fca9bc439`. +- `test-recurrent-state-rollback` on the actual MoE GGUF. +- The MTP greedy-equivalence and partial-rejection gates. + +## Stop Conditions + +Stop and do not benchmark MTP for speed if: + +- rollback test fails, +- MTP output differs from greedy baseline at `temp=0` after normalizing the + example frontend's leading newlines, +- no run can produce both `n_drafted > 0` and `n_drafted > n_accept`, +- any run requires backend draft sampling for MTP, +- DGX is not free of docker containers, `local-ai-worker`, and GPU compute + processes. + +## Follow-up + +Only after Phase 14 passes should Phase 15 measure serving/API throughput. +Phase 15 must compare non-spec serving against MTP serving with the same prompt +shape, request count, seed behavior, and canonical inference gates. diff --git a/docs/superpowers/specs/2026-07-01-small-m-mmq-phase32.md b/docs/superpowers/specs/2026-07-01-small-m-mmq-phase32.md new file mode 100644 index 000000000000..58e8cd2c7d1d --- /dev/null +++ b/docs/superpowers/specs/2026-07-01-small-m-mmq-phase32.md @@ -0,0 +1,101 @@ +# Small-M MoE MMQ Phase 32 Spec + +## Problem + +Phase 30 proved n128 serving feeds grouped-MMQ with small decode-like +per-expert shapes (`ncols_max <= 128`, density `1-4`, selected `mmq_x <= 64`). +Phase 31 proved the obvious launch-policy shortcut is not the issue: in live +n128 serving, all traced decode-like and prefill-like launch lines had +`fixup=0` and `stream_k_blocks == ntiles_dst`. + +The remaining grouped-MMQ gap is therefore structural small-M kernel shape: +the kernel is already launched without fixup overhead, but the work inside each +expert tile still pays for padded, low-density token columns. + +## Constraints + +- Preserve default behavior unless an explicit experimental env/build knob is + set. +- Keep the patch stack incremental: add helpers or alternate launch branches + instead of rewriting existing MMQ templates. +- Prefer host-side selection shortcuts and small helper functions over broad + template refactors, to reduce upstream conflict risk. +- Every source change must be gated by: + - `test-cuda-mmq-shape-trace` or a new host/unit test for selector behavior. + - DGX CUDA build of `llama-server`, `llama-completion`, `test-backend-ops`. + - Default-off MoE md5 `8cb0ce23777bf55f92f63d0292c756b0`. + - Default-off dense md5 `5951a5b4d624ce891e22ab5fca9bc439`. + - `MUL_MAT_ID` `806/806`. + - Trace/knob-enabled md5/op gate when the experiment is expected to be + numerically identical. + +## Rejected By Evidence + +- No-fixup/no-stream-k shortcut: Phase 31 n128 serving had decode-like + `4800/4800` and prefill-like `4920/4920` launch lines with `fixup=0` and + `stream_k_blocks == ntiles_dst`. +- Build-time MMQ occupancy shortcuts: Phase 28 rejected `GGML_CUDA_FP4_MINBLOCKS=2` + as slower and `GGML_CUDA_FP4_MMQ_Y=64` as compile-invalid for NVFP4 writeback. + +## Candidate Directions + +### A. Exact Expert Histogram Trace + +Add a default-off diagnostic that records exact per-expert segment lengths after +`expert_bounds` is available. This requires care because device-to-host readback +can synchronize the stream and perturb serving; it should run only in a +standalone diagnostic path, never in normal serving gates. + +Use this only if selector estimates are insufficient for designing the next +kernel. + +### B. Decode-Only Alternative Small-M Kernel Hook + +Add an opt-in branch for grouped MoE NVFP4 decode-like shapes: + +- `args.expert_bounds != nullptr` +- `type == GGML_TYPE_NVFP4` +- `args.ncols_max <= 128` +- estimated density `<= 4` +- selected `mmq_x <= 64` + +The first implementation should be a compile-time skeleton or dispatch counter, +not a numeric kernel, unless the exact implementation can be tested against +`MUL_MAT_ID` in isolation. The gate is a new `test-backend-ops` case covering +ragged MoE decode shapes before serving A/B. + +### C. W4A16 / Marlin-Style Decode Probe + +Re-use the existing W4A16 scaffolding only as a separately gated probe. Prior +decode W4A16 work was rejected as bandwidth-bound, while prefill remains the +higher-EV W4A16 target. Do not mix this with the small-M MMQ branch unless a +new in-backend A/B shows decode benefit. + +## Recommended Phase 32 Deliverable + +Do not jump straight to a large kernel. The next deliverable should be a small, +default-off dispatch classification patch: + +1. Factor the Phase 30/31 decode-like predicate into a host helper. +2. Add a test proving the helper selects only small-M grouped MoE NVFP4 shapes + and excludes prefill. +3. Add a bounded log/counter prefix such as `[LLAMA_MOE_MMQ_SMALL_M]` under the + existing trace knob or a more specific `LLAMA_MOE_MMQ_SMALL_M_TRACE`. +4. Re-run n128 serving to verify the candidate branch population before any + numeric kernel work. + +This keeps the next patch additive, md5-safe, and low-conflict while giving a +hard count for the future structural branch. + +## Subagent Findings Folded In + +- llama.cpp path: `ggml_cuda_mul_mat_id` routes quantized MoE to grouped MMQ via + `ggml_cuda_should_use_mmq`; `mmq_args` carries `expert_bounds`, `ids_dst`, + `ncols_dst=ne12*n_expert_used`, `nchannels_x=ne02`, and `ncols_max=ne12`. +- The tile selector in `mul_mat_q_case` is the correct low-conflict hook: + `LLAMA_MOE_MMQ_X`, `LLAMA_MOE_AUTO_TILE`, `LLAMA_MOE_DECODE_TILE`, and + `LLAMA_MOE_DENSITY_MAX` already prove this branch can be changed host-side. +- vLLM's useful GB10-compatible idea is small expert `block_size_m` selection + (`8/16` for low-density routed rows), not TMA/tcgen05/Triton/CUTLASS paths. +- Phase 32 should therefore add a default-off candidate classifier and trace, + then use the measured candidate count to decide whether to A/B `mmq_x=8/16`. diff --git a/gallery/index.yaml b/gallery/index.yaml index a5256840560f..df0a1065a512 100644 --- a/gallery/index.yaml +++ b/gallery/index.yaml @@ -1,4 +1,223 @@ --- +# ============================================================================= +# NVFP4 Qwen3.6 (dense + MoE) for the LocalAI paged-attention llama.cpp backend. +# These reproduce the GB10 / DGX Spark benchmark serving config (see +# backend/cpp/llama-cpp-localai-paged/docs/LOCALAI_LLAMACPP_BACKEND_PLAN.md section 2). +# +# PUBLISHED: the dense + MoE base NVFP4 GGUFs are live at huggingface.co/mudler/ +# Qwen3.6-27B-NVFP4-GGUF and .../Qwen3.6-35B-A3B-NVFP4-GGUF (file_type MOSTLY_NVFP4); +# the sha256 below were verified against the Hub LFS hash and the uris resolve (200). +# Converted from the unsloth/nvidia NVFP4 sources via llama.cpp --outtype auto. +# +# NOTE(NVFP4 read): the paged backend (pinned llama.cpp c299a92c) reads NVFP4 GGUF +# (the GB10 benchmark + the pin-sync md5 gate both ran NVFP4 GGUFs). These gallery +# GGUFs were re-quantized with a newer convert (origin/master) preserving the same +# MOSTLY_NVFP4 format; a load check on the paged backend GPU build is the final gate. +# +# The two NVFP4 entries below are bit-exact (f32 SSM state). The opt-in +# reduced-precision hybrid SSM-state lever (ssm_bf16_tau, patch 0026) was DROPPED: +# clean measurements showed it flat once the decode fusions landed (forcing all +# gated-DeltaNet heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit) - see +# backend/cpp/llama-cpp-localai-paged/README.md section 5. +# ============================================================================= +- name: "qwen3.6-27b-nvfp4-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF + description: | + Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell). + + Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for LocalAI's + paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged KV cache + plus a decode-first prefill budget. Benchmarked on GB10 / DGX Spark (consumer Blackwell) + at 90-117% of vLLM dense decode throughput at 1.5-3x lower memory (GB10-specific figures). + + Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged backend's + upstream pin) - verify on a GPU box before relying on this entry. + license: "apache-2.0" + tags: + - llm + - gguf + - nvfp4 + - blackwell + - reasoning + icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + model: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-GGUF/q36-27b-nvfp4.gguf + sha256: 2fdd857b13cbaa37b913d9566bf0a69443dcdb702e95694ca8d75236710575d4 + uri: https://huggingface.co/mudler/Qwen3.6-27B-NVFP4-GGUF/resolve/main/q36-27b-nvfp4.gguf +- name: "qwen3.6-35b-a3b-nvfp4-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF + description: | + Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell). + + Qwen3.6-35B-A3B MoE (~3B active), native Blackwell NVFP4 (FP4-MMA) GGUF. Configured for + LocalAI's paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged + KV cache plus a decode-first prefill budget. Lighter on memory than the dense 27B thanks + to the sparse MoE activation. + + Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged backend's + upstream pin) - verify on a GPU box before relying on this entry. + license: "apache-2.0" + tags: + - llm + - gguf + - nvfp4 + - blackwell + - moe + - reasoning + icon: https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # decode-first budget; set 256 for max saturated MoE decode (sweep winner) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-GGUF/q36-35b-a3b-nvfp4.gguf + sha256: 1690d0424e232527b8bb135a38033e4699ad11817677eebacd40349020faea52 + uri: https://huggingface.co/mudler/Qwen3.6-35B-A3B-NVFP4-GGUF/resolve/main/q36-35b-a3b-nvfp4.gguf +- name: "qwen3.6-27b-nvfp4-mtp-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF + description: | + Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell). + + Qwen3.6-27B dense, native Blackwell NVFP4 (FP4-MMA) GGUF with a built-in MTP + (multi-token-prediction / speculative) draft head, configured for LocalAI's + paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand paged KV + cache plus a decode-first prefill budget. The MTP draft head accelerates decode + via self-speculation; ships with the recommended Qwen3.6 sampling defaults. + + Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged + backend's upstream pin) - verify on a GPU box before relying on this entry. + license: "apache-2.0" + tags: + - llm + - gguf + - nvfp4 + - blackwell + - mtp + - reasoning + icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_27b_score.png + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # LLAMA_MAX_BATCH_TOKENS=512 (decode-first QoS budget) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + min_p: 0 + model: llama-cpp/models/Qwen3.6-27B-NVFP4-MTP-GGUF/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf + presence_penalty: 1.5 + repeat_penalty: 1 + temperature: 0.7 + top_k: 20 + top_p: 0.8 + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwen3.6-27B-NVFP4-MTP-GGUF/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf + sha256: d088e57e8c35ff62c2a420cb888dad3fd53c8db3ed9ead4286bd383224f81b50 + uri: https://huggingface.co/michaelw9999/Qwen3.6-27B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-27B-NVFP4-MTP-GGUF.gguf +- name: "qwen3.6-35b-a3b-nvfp4-mtp-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF + description: | + Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell). + + Qwen3.6-35B-A3B MoE (~3B active), native Blackwell NVFP4 (FP4-MMA) GGUF with a + built-in MTP (multi-token-prediction / speculative) draft head, configured for + LocalAI's paged-attention llama.cpp backend (llama-cpp-localai-paged): on-demand + paged KV cache plus a decode-first prefill budget. The MTP draft head accelerates + decode via self-speculation; ships with the recommended Qwen3.6 sampling defaults. + + Requires a llama.cpp new enough to read the NVFP4 GGUF tensor type (the paged + backend's upstream pin) - verify on a GPU box before relying on this entry. + license: "apache-2.0" + tags: + - llm + - gguf + - nvfp4 + - blackwell + - moe + - mtp + - reasoning + icon: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3.6/Figures/qwen3.6_35b_a3b_score.png + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # decode-first budget; set 256 for max saturated MoE decode (sweep winner) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + min_p: 0 + model: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf + presence_penalty: 1.5 + repeat_penalty: 1 + temperature: 0.7 + top_k: 20 + top_p: 0.8 + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf + sha256: f3d2fdc74e3ef19925ccbf794b04d7f6f11fb12eba7722b7749219d0cc5c36ed + uri: https://huggingface.co/michaelw9999/Qwen3.6-35B-A3B-NVFP4-MTP-GGUF/resolve/main/Qwen3.6-35B-A3B-NVFP4-MTP-TURBO.gguf - name: "qwen-agentworld-35b-a3b" url: "github:mudler/LocalAI/gallery/virtual.yaml@master" urls: @@ -664,6 +883,81 @@ - filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf sha256: 1c163f0e1f29485d432b466b9e5e0593ea9b10c5a62cf3eb71b77fcfe41db46c uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf +- name: "qwopus3.6-27b-v2-mtp-nvfp4-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF + description: "Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).\n\n\U0001FA90 Qwopus3.6-27B-v2-MTP\nMTP Release\n\nMulti-Token Prediction reasoning model fine-tuned from Qwen3.6-27B\n\n\U0001F9EC Trace Inversion & Negentropy\n\U0001F9E0 27B Parameters\n⚡ Speculative Decoding\n\U0001F6E0️ Coding / DevOps / Math\n\n\U0001F4A1 What is Qwopus3.6-27B-v2-MTP?\n\U0001FA90 Qwopus3.6-27B-v2-MTP is a speed-oriented reasoning release built on top of Qwen3.6-27B. It keeps the Qwopus line's focus on reconstructed reasoning traces, coding discipline, DevOps procedures, and mathematical derivations, while adding Multi-Token Prediction for faster generation. The goal is simple: preserve the depth and structure of a 27B reasoning model while making real interactive use noticeably faster.\n\n⚡ MTP DecodingAuxiliary future-token prediction improves throughput on long reasoning, code, math, and strict-format prompts.\n\U0001F9E9 Structured ReasoningInherits the Qwopus training recipe built around reconstructed step-by-step reasoning trajectories.\n\U0001F9EA GB10 TestedValidated on a 30-question local benchmark across Logic, Coding, DevOps, Math, and Edge tasks.\n\U0001F680 Practical SpeedDesigned for workflows where strong answers matter, but waiting several extra minutes per task does not.\n\n...\n\n\nLocalAI paged-attention backend variant (llama-cpp-localai-paged): on-demand paged KV cache plus a decode-first prefill budget.\n" + tags: + - llm + - gguf + - nvfp4 + - blackwell + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + function: + automatic_tool_parsing_fallback: true + grammar: + disable: true + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # decode-first QoS budget (27B dense) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + model: llama-cpp/models/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf + sha256: 2a0a36fd10374c2a85356121c7c315bda725c7eaca0b3ae14838567629c6924a + uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-v2-MTP-NVFP4-GGUF.gguf +- name: "qwopus3.6-27b-coder-mtp-nvfp4-paged" + url: "github:mudler/LocalAI/gallery/virtual.yaml@master" + urls: + - https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF + description: "Blackwell GPU recommended (native FP4-MMA). Runs on other hardware via NVFP4 dequant, but slower; the throughput figures below are GB10 / DGX Spark (consumer Blackwell).\n\n\U0001FA90 Qwopus-3.6-27B-Coder\nCoder SFT Release\n\nAgentic Coding & Tool-Use Reasoning Model Fine-Tuned on Qwopus3.6-27B-v2\n\n\U0001F9EC Trace Inversion & Negentropy\n\U0001F9E0 27B Dense Model\n⚡ Agentic Coding\n\U0001F6E0️ Tool Calling & Agent\n\U0001F3C6 SWE-bench Verified: 67.0% (off-thinking)\n\n\U0001F4A1 What is Qwopus-3.6-27B-Coder?\n\U0001FA90 Qwopus-3.6-27B-Coder is a reasoning-enhanced agentic coding model built on top of Qwopus3.6-27B-v2. It inherits the powerful reasoning foundation of the v2 base — which achieved 87.43% MMLU-Pro (300ex) and 75.25% SWE-bench Verified — and further specializes it for agentic code generation, structured tool calling, debugging, and instruction-following in developer workflows. The model is designed to excel at repository-level coding tasks, multi-turn tool orchestration, and complex logical reasoning under realistic agent environments.\n\n\U0001F9E9 Agentic Coding\nOptimized for repository-level coding, debugging, patch generation, and structured multi-step development workflows.\n\n\U0001F6E0️ Tool Calling\nLearns from real agent trajectories with tool definitions, tool calls, and environment feedback for robust multi-turn execution.\n\n...\n\n\nLocalAI paged-attention backend variant (llama-cpp-localai-paged): on-demand paged KV cache plus a decode-first prefill budget.\n" + tags: + - llm + - gguf + - nvfp4 + - blackwell + icon: https://cdn-uploads.huggingface.co/production/uploads/66309bd090589b7c65950665/sGQKmrMc6L6guMoaB5_Y2.png + overrides: + backend: llama-cpp-localai-paged + f16: true + flash_attention: "on" + context_size: 131072 + gpu_layers: 99 + batch: 512 + function: + automatic_tool_parsing_fallback: true + grammar: + disable: true + known_usecases: + - chat + options: + - use_jinja:true + - paged_kv:true # LLAMA_KV_PAGED=1 + - max_batch_tokens:512 # decode-first QoS budget (27B dense) + - kv_unified:false # per-slot paged capacity/memory benefit needs a per-sequence cache + - parallel:128 # 128 serving slots + parameters: + model: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf + template: + use_tokenizer_template: true + files: + - filename: llama-cpp/models/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf + sha256: 1c163f0e1f29485d432b466b9e5e0593ea9b10c5a62cf3eb71b77fcfe41db46c + uri: https://huggingface.co/michaelw9999/Qwopus3.6-27B-Coder-MTP-NVFP4-GGUF/resolve/main/Qwopus3.6-27B-Coder-MTP-NVFP4-TURBO.gguf - name: "qwen3.6-27b-nvfp4-mtp" url: "github:mudler/LocalAI/gallery/virtual.yaml@master" urls: diff --git a/pkg/xsysinfo/gpu.go b/pkg/xsysinfo/gpu.go index da183212f46e..efff0e28e889 100644 --- a/pkg/xsysinfo/gpu.go +++ b/pkg/xsysinfo/gpu.go @@ -440,6 +440,20 @@ func parseComputeCap(cc string) (int, int) { return maj, min } +// IsNVIDIABlackwell reports whether an NVIDIA Blackwell-class consumer GPU is +// present, i.e. compute capability 12.x (sm_120 RTX 50-series, sm_121 GB10 / +// DGX Spark). Cached via NVIDIAComputeCapability. +// +// Note: datacenter Blackwell (B100/B200/GB200, sm_100 / cc 10.0) reports a +// different compute capability and is intentionally NOT matched here: this +// targets the sm_12x family where we measured the larger-physical-batch MoE +// prefill win. Returns false when nvidia-smi is unavailable or reports no 12.x +// device. +func IsNVIDIABlackwell() bool { + maj, _ := parseComputeCap(NVIDIAComputeCapability()) + return maj >= 12 +} + // getNVIDIAGPUMemory queries NVIDIA GPUs using nvidia-smi func getNVIDIAGPUMemory() []GPUMemoryInfo { // Check if nvidia-smi is available diff --git a/scripts/changed-backends.js b/scripts/changed-backends.js index 758cdb8695b7..61349b6c48b9 100644 --- a/scripts/changed-backends.js +++ b/scripts/changed-backends.js @@ -47,6 +47,15 @@ function inferBackendPath(item) { // via a thin wrapper Makefile. Changes to either dir should retrigger it. return `backend/cpp/turboquant/`; } + // llama-cpp-localai-paged is the LocalAI paged-attention llama.cpp variant: the + // SAME upstream pin as stock llama-cpp plus the paged patch series, reusing + // backend/cpp/llama-cpp sources via a thin wrapper Makefile. Keep this branch + // BEFORE the generic `endsWith("llama-cpp")` branch below: although + // "Dockerfile.llama-cpp-localai-paged".endsWith("llama-cpp") is already false, + // the specific branch documents the mapping and is robust to future renames. + if (item.dockerfile.endsWith("llama-cpp-localai-paged")) { + return `backend/cpp/llama-cpp-localai-paged/`; + } if (item.dockerfile.endsWith("privacy-filter")) { return `backend/cpp/privacy-filter/`; } @@ -66,6 +75,13 @@ function inferBackendPathDarwin(item) { if (item.backend === "llama-cpp") { return `backend/cpp/llama-cpp/`; } + // llama-cpp-localai-paged on Darwin (the -metal-darwin-arm64-llama-cpp-localai-paged + // includeDarwin row) builds from the C++ sources under + // backend/cpp/llama-cpp-localai-paged, like stock llama-cpp. The matrix entry + // carries lang=go for runner/toolchain selection, but the source is C++. + if (item.backend === "llama-cpp-localai-paged") { + return `backend/cpp/llama-cpp-localai-paged/`; + } // ds4 is C++ too (built via `make backends/ds4-darwin`); the matrix entry // carries lang=go for runner/toolchain selection, but the source is C++. if (item.backend === "ds4") { @@ -281,6 +297,11 @@ function emitFilteredMatrix(changedFiles) { if (backend === "turboquant" && !changed) { changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/")); } + // llama-cpp-localai-paged reuses backend/cpp/llama-cpp sources via a thin + // wrapper; changes to either directory should retrigger its pipeline. + if (backend === "llama-cpp-localai-paged" && !changed) { + changed = changedFiles.some(file => file.startsWith("backend/cpp/llama-cpp/")); + } fs.appendFileSync(process.env.GITHUB_OUTPUT, `${backend}=${changed ? 'true' : 'false'}\n`); } }