Virgil Lemma foundations by Snider · Pull Request #8 · dAppCore/go-mlx

Snider · 2026-05-20T05:58:29Z

Summary by CodeRabbit

New Features
- Qwen 2/3 and Qwen 3.6 model support; new adapter with buffered and streaming generation.
- Block‑prefix cache service and memvid bundle index for faster prefix restores.
- Agentic memory: wake/sleep workflows, state bundles and memvid integration; session‑state artifact export.
Improvements
- Device‑aware memory planner; expanded chunked generation, prompt‑cache warm/restore and KV snapshot flows.
- Build/toolchain updated (C++23) and macOS deployment target raised.
Documentation
- Extensive new/updated docs: architecture, runtime, inference, memory, MoE, training and benchmarks.

coderabbitai · 2026-05-20T05:58:53Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Bumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added.

Core changes

Layer / File(s)	Summary
All changes (build, adapter, backend, agent, cache, artifact, tests, docs) `.gitignore`, `.gitmodules`, `CMakeLists.txt`, `cpp/CMakeLists.txt`, `external/`, `go/adapter.go`, `go/adapter/`, `go/backend.go`, `go/agent/`, `go/blockcache/`, `go/artifact/`, `go/_test.go`, `docs/*`	Consolidated patch applying repository setup updates, adapter extraction, backend API and behaviour refactor (chunked generation, prompt-cache warm/restore, KV snapshot capture with options), memvid index and wake/sleep orchestration, block-prefix cache service, artifact export, many tests, and extensive documentation and examples.

Warning

Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption.

coderabbitai

Actionable comments posted: 18

🧹 Nitpick comments (10)

docs/inference/thinking.md (1)
74-78: 💤 Low value

Add language specifier to fenced code block.

The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 ThinkingShow:    every token → visible stream
 ThinkingHide:    inside-block tokens → /dev/null; outside-block tokens → visible
 ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/inference/thinking.md around lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` to
markdown linter.
docs/runtime/README.md (2)
68-68: 💤 Low value

Consider using "preload" as one word.

In computing terminology, "preload" is typically written as a single word rather than hyphenated.
📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation
+- [../model/model_pack.md](../model/model_pack.md) — preload validation
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` at line 68, Update the link text in
docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load
validation" to use the single-word form "preload" (i.e., change "pre-load
validation" to "preload validation") so the description next to the
model_pack.md link uses the conventional computing term; locate the occurrence
of "pre-load validation" and replace it with "preload validation".
44-62: 💤 Low value

Add language specifier to fenced code block.

The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-```
+```text
 package init time:
   register_metal.go init() → inference.Register(&metalbackend{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing
the boot flow (starting with "package init time:") lacks a language specifier,
causing MD040 lint failures; update the opening backticks to include a language
tag (e.g., add "text" so the block begins with ```text) in README.md near the
boot flow that references register_metal.go init(),
inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and
metaladapter usage to satisfy the markdown linter.
docs/moe/README.md (1)
9-9: ⚡ Quick win

Consider rewording for clarity.

The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.
✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
Or alternatively:
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
+The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense
models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it
with a clearer phrasing that conveys those dense models existed before this
sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen
3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1
paragraph to use this clearer wording so the relationship between prior dense
models and the new sparse-expert work is unambiguous.
docs/observability/probe.md (1)
31-46: 💤 Low value

Add language specifier to fenced code block.

The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or yaml for structured output).
📝 Proposed fix
-```
+```text
 Generate / Chat:
   prefill start                → cache_pressure (initial)
   per layer                    → layer_coherence + selected_heads
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in
the emission points section lacks a language specifier; update the opening
triple-backticks to include a language (for example change ``` to ```text or
```yaml) so the block is rendered/compliant (the block that begins with
"Generate / Chat:" and lists items like "prefill start → cache_pressure" should
be updated).
docs/moe/jang.md (1)
82-90: 💤 Low value

Add language specifier to fenced code block.

The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g., text or leave empty but specify).
📝 Proposed fix
-```
+```text
 JANG_2M — 2-bit mid-tier
 JANG_3M — 3-bit mid-tier
 JANG_4M — 4-bit (most common)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the
fenced code block that lists the profile names (the block containing "JANG_2M —
2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening
triple-backtick with one that specifies a language identifier (e.g., text) so
the block becomes a fenced code block with a language label for consistent
Markdown rendering.
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)
7-9: 💤 Low value

Consider using relative or generic path references.

The absolute paths /Users/snider/Code/core/go-mlx and /private/tmp/vmlx-audit-20260509 are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9,
Replace the machine-specific absolute paths in the plan document (the two
occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
docs/vmlx-feature-gap-report.md (1)
7-8: 💤 Low value

Consider using relative or generic path references.

The absolute path /private/tmp/vmlx-audit-20260509 and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded
absolute filesystem path and the full external URL in the report text with more
maintainable references: change the absolute path string to a relative or
generic placeholder (e.g., "cloned locally at <local-clone-path>" or
"<audit-clone-path>") and move the external repository URL to a footnote,
appendix, or a single "References" section, or replace it with a short
identifier combined with a reference list; update the text around the original
literal mentions so it reads the same but without embedding environment-specific
paths.
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)
5-6: 💤 Low value

Consider using relative or generic path references.

The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`
around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
go/agent/index_test.go (1)
16-304: ⚡ Quick win

Add at least one _Ugly triplet case for the public index API surface.

This file has _Good and _Bad coverage, but no _Ugly case following the repository convention.

As per coding guidelines: go/**/*_test.go: Public functions in foo.go must have their Good/Bad/Ugly test triplets in foo_test.go, with suffix conventions: _Good for happy path, _Bad for expected error conditions, _Ugly for panic/edge cases.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly
suffix in this file that completes the Good/Bad/Ugly triplet for the public
index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_*
that triggers and asserts panic/edge behaviors for the public functions (e.g.,
NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.

In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.

In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.

In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.

In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.

In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.

In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.

In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.

In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.

In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.

In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.

---

Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.

In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.

In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.

In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).

In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.

In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.

In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.

In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.

In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08

📥 Commits

Reviewing files that changed from the base of the PR and between 07f6af1 and 89f613e.

📒 Files selected for processing (300)

.gitignore
.gitmodules
CLAUDE.md
CMakeLists.txt
GOAL.md
docs/README.md
docs/architecture.md
docs/build.md
docs/cmd/violet.md
docs/compute/compute.md
docs/development.md
docs/examples/compute/frame-pipeline.md
docs/examples/daemon/violet-socket.md
docs/examples/eval/attention-probe.md
docs/examples/eval/perplexity.md
docs/examples/inference/batch.md
docs/examples/inference/chat.md
docs/examples/inference/quantization.md
docs/examples/inference/streaming.md
docs/examples/model-ops/hf-fit.md
docs/examples/model-ops/kv-snapshot.md
docs/examples/model-ops/merge.md
docs/examples/model-ops/quantize-gguf.md
docs/examples/training/distill.md
docs/examples/training/grpo.md
docs/examples/training/lora-finetune.md
docs/examples/training/lora-fuse.md
docs/history.md
docs/index.md
docs/inference/README.md
docs/inference/block_cache.md
docs/inference/decode_optimisation.md
docs/inference/parser_registry.md
docs/inference/scheduler.md
docs/inference/thinking.md
docs/memory/README.md
docs/memory/agent_memory.md
docs/memory/agentic_project_seed.md
docs/memory/kv_snapshot.md
docs/memory/kv_snapshot_blocks.md
docs/memory/kv_snapshot_index.md
docs/memory/kv_snapshot_memvid.md
docs/memory/medium.md
docs/memory/state_bundle.md
docs/model-operations.md
docs/model/README.md
docs/model/memory_plan.md
docs/model/model_pack.md
docs/models.md
docs/moe/README.md
docs/moe/codebook_vq.md
docs/moe/expert_residency.md
docs/moe/jang.md
docs/moe/minimax_m2.md
docs/observability/probe.md
docs/runtime/2026-05-16-gemma4-e2b-driver-profile.md
docs/runtime/2026-05-17-gemma4-parity-and-last-logits.md
docs/runtime/2026-05-17-llamacpp-prefill-comparison.md
docs/runtime/2026-05-18-gemma4-mtp-speculative-decode.md
docs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.md
docs/runtime/2026-05-19-gemma4-e2b-quant-matrix.md
docs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.md
docs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.md
docs/runtime/2026-05-19-goal-completion-audit.md
docs/runtime/2026-05-19-runner-calibration.md
docs/runtime/2026-05-20-chapter-profile-safety.md
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
docs/runtime/README.md
docs/runtime/adapter.md
docs/runtime/local_autotune.md
docs/runtime/register_metal.md
docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md
docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md
docs/training/README.md
docs/training/distill.md
docs/training/eval.md
docs/training/grpo.md
docs/training/lora_adapter.md
docs/training/sft.md
docs/vmlx-feature-gap-report.md
external/go-ai
external/go-inference
external/go-ml
go/adapter.go
go/adapter/adapter.go
go/adapter_example_test.go
go/adapter_test.go
go/agent/helpers.go
go/agent/index.go
go/agent/index_test.go
go/agent/test_helpers_test.go
go/agent/wake_sleep.go
go/api_common.go
go/api_common_example_test.go
go/api_darwin_test.go
go/api_shape_test.go
go/api_stub.go
go/api_stub_example_test.go
go/api_stub_test.go
go/api_test.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_stub_example_test.go
go/api_tokenizer_stub_test.go
go/artifact/artifact.go
go/artifact/artifact_test.go
go/attention_test.go
go/backend.go
go/backend_example_test.go
go/backend_test.go
go/blockcache/blockcache.go
go/blockcache/blockcache_test.go
go/blockcache/helpers_test.go
go/bundle/bundle.go
go/bundle/bundle_test.go
go/bundle/example_test.go
go/bundle/sami.go
go/chaptersmoke/chaptersmoke.go
go/chaptersmoke/chaptersmoke_test.go
go/chat/chat.go
go/chat/chat_test.go
go/chat/example_test.go
go/cmd/go-mlx/main.go
go/cmd/go-mlx/main_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/cmd/mlx/split_ffn_tune.go
go/compute/compute.go
go/compute/compute_example_test.go
go/compute/compute_metal.go
go/compute/compute_metal_example_test.go
go/compute/compute_metal_helper_test.go
go/compute/compute_metal_test.go
go/compute/compute_test.go
go/compute_stub.go
go/compute_stub_example_test.go
go/compute_stub_test.go
go/compute_test.go
go/dataset/jsonl.go
go/dataset/sample.go
go/dataset_stream.go
go/dataset_stream_example_test.go
go/dataset_stream_test.go
go/device_info.go
go/distill.go
go/distill_test.go
go/eval.go
go/eval_darwin.go
go/eval_darwin_test.go
go/eval_stub.go
go/eval_test.go
go/fast_eval.go
go/fast_eval_example_test.go
go/fast_eval_runner.go
go/fast_eval_test.go
go/gguf/info.go
go/gguf/info_example_test.go
go/gguf/info_test.go
go/gguf/quantize.go
go/gguf/quantize_test.go
go/grpo.go
go/grpo_test.go
go/helpers.go
go/hf/hf.go
go/hf/hf_test.go
go/hf/test_helpers_test.go
go/hf_fit.go
go/inference_contract.go
go/inference_contract_test.go
go/internal/metal/activation_bridge.cpp
go/internal/metal/array.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/batch.go
go/internal/metal/cache.go
go/internal/metal/cache_test.go
go/internal/metal/close.go
go/internal/metal/codebook_vq.go
go/internal/metal/codebook_vq_test.go
go/internal/metal/compile.go
go/internal/metal/compile_test.go
go/internal/metal/decode.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/decode_bridge.h
go/internal/metal/decode_test.go
go/internal/metal/dense_matvec.go
go/internal/metal/dense_matvec_test.go
go/internal/metal/device.go
go/internal/metal/dtype.go
go/internal/metal/error_test.go
go/internal/metal/expert_id_matvec.go
go/internal/metal/expert_id_matvec_test.go
go/internal/metal/fast.go
go/internal/metal/fast_test.go
go/internal/metal/gemma3.go
go/internal/metal/gemma4.go
go/internal/metal/gemma4_assistant.go
go/internal/metal/gemma4_assistant_decode.go
go/internal/metal/gemma4_assistant_decode_example_test.go
go/internal/metal/gemma4_assistant_decode_test.go
go/internal/metal/gemma4_assistant_generate.go
go/internal/metal/gemma4_assistant_generate_test.go
go/internal/metal/gemma4_assistant_pair.go
go/internal/metal/gemma4_assistant_test.go
go/internal/metal/gemma4_ffn_residual.go
go/internal/metal/gemma4_ffn_residual_test.go
go/internal/metal/gemma4_router_topk.go
go/internal/metal/gemma4_router_topk_test.go
go/internal/metal/gemma4_test.go
go/internal/metal/gemma4_vision.go
go/internal/metal/generate.go
go/internal/metal/generate_test.go
go/internal/metal/jang_dequant.go
go/internal/metal/jang_dequant_test.go
go/internal/metal/kv_snapshot.go
go/internal/metal/metal.go
go/internal/metal/minimax_m2.go
go/internal/metal/minimax_m2_test.go
go/internal/metal/mlx_mlx_backend_cpu_available.cpp
go/internal/metal/mlx_mlx_backend_gpu_device_info.cpp
go/internal/metal/model.go
go/internal/metal/model_test.go
go/internal/metal/nn.go
go/internal/metal/nn_test.go
go/internal/metal/ops.go
go/internal/metal/process_memory_darwin.go
go/internal/metal/process_memory_stub.go
go/internal/metal/prompt_cache.go
go/internal/metal/prompt_cache_test.go
go/internal/metal/qwen3.go
go/internal/metal/qwen3_test.go
go/internal/metal/runtime_gate.go
go/internal/metal/runtime_gate_example_test.go
go/internal/metal/runtime_gate_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/internal/metal/session_example_test.go
go/internal/metal/session_test.go
go/internal/metal/split.go
go/internal/metal/split_test.go
go/internal/metal/stream.go
go/internal/metal/tokenizer.go
go/internal/metal/tokenizer_test.go
go/internal/metal/trace.go
go/internal/metal/trace_test.go
go/internal/metal/training.go
go/jang_test.go
go/kv/analysis.go
go/kv/analysis_example_test.go
go/kv/analysis_test.go
go/kv/bench.go
go/kv/bench_test.go
go/kv/blocks.go
go/kv/blocks_test.go
go/kv/helpers_test.go
go/kv/memvid.go
go/kv/memvid_test.go
go/kv/snapshot.go
go/kv/snapshot_example_test.go
go/kv/snapshot_test.go
go/kv_analysis_example_test.go
go/kv_cache_bench.go
go/kv_snapshot.go
go/kv_snapshot_example_test.go
go/kv_snapshot_test.go
go/local_tuning.go
go/local_tuning_test.go
go/lora/adapter.go
go/lora/fuse.go
go/lora/fuse_stub.go
go/lora/fuse_test.go
go/lora_adapter_darwin_test.go
go/lora_adapter_test.go
go/lora_fuse.go
go/lora_fuse_darwin.go
go/lora_fuse_darwin_test.go
go/lora_fuse_test.go
go/medium_test.go
go/memory/example_test.go
go/memory/memory.go
go/memory/memory_test.go
go/memory_plan.go
go/memory_plan_example_test.go
go/memory_plan_test.go
go/memvid_chapter_smoke.go
go/merge/compare.go
go/merge/compare_example_test.go
go/merge/compare_test.go
go/merge/helpers_test.go
go/merge/merge.go
go/merge/merge_test.go
go/mlx.go
go/mlx_example_test.go
go/mlx_internal_test.go
go/mlx_stub.go
go/mlx_stub_example_test.go

💤 Files with no reviewable changes (15)

go/api_test.go
go/api_stub_example_test.go
go/api_tokenizer_stub_test.go
go/adapter_example_test.go
go/api_tokenizer_stub.go
go/api_tokenizer_darwin_test.go
go/api_tokenizer_stub_example_test.go
go/backend_example_test.go
go/api_common_example_test.go
go/api_shape_test.go
go/api_common.go
go/api_darwin_test.go
go/attention_test.go
go/api_stub.go
go/api_stub_test.go

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9

📥 Commits

Reviewing files that changed from the base of the PR and between 89f613e and c19bc07.

📒 Files selected for processing (22)

CMakeLists.txt
cpp/CMakeLists.txt
go/backend.go
go/backend_test.go
go/cmd/mlx/main.go
go/cmd/mlx/main_test.go
go/internal/metal/backend.go
go/internal/metal/backend_test.go
go/internal/metal/decode_bridge.cpp
go/internal/metal/gemma4.go
go/internal/metal/gemma4_test.go
go/internal/metal/generate.go
go/internal/metal/metal.go
go/internal/metal/mlx_build_config.h
go/internal/metal/pinned_array.go
go/internal/metal/pinned_array_bridge.cpp
go/internal/metal/pinned_array_test.go
go/internal/metal/sample.go
go/internal/metal/sample_test.go
go/internal/metal/session.go
go/kv/snapshot.go
go/memvid_chapter_smoke.go

✅ Files skipped from review due to trivial changes (1)

cpp/CMakeLists.txt

github-advanced-security

SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.

…stinationRoot (MkdirTemp output, no trailing sep) and POSIX-clean relative concat to a clean OS-native path; mirrors walkMedium's Clean-skip; CPU-only (alloc count unchanged), saves filepath.Join overhead per visited entry across staged model loads

…+ slow alias path — NormaliseRole_Canonical 6.0→0.62 ns (-90%, ~10× faster), Format paths inline the common-case switch (no fn call per message for user/assistant/system inputs)

…e — Chat/ChatStream/ChatChunksStream zero-alloc inference.Message and metal.ChatMessage are bit-identical layout ({Role string; Content string}, same field order). Replaced the make+per-message copy in Chat / ChatChunksStream / ChatStream with chatMessagesAsMetal — a layout-guarded unsafe.Slice reinterpret. The receiving metal.Chat / metal.ChatChunks paths only read from the slice (they format it into a prompt string and return), so the borrow lifetime is bounded by the call. Benchmarks (M3 Ultra): Short (2 msg): was make+2 copies + 1 alloc → 0.65 ns / 0 B / 0 Long (20 msg): was make+20 copies + 1 alloc → 0.63 ns / 0 B / 0 Saves one slice alloc + N×32 B per-message copy on every Chat, ChatStream and ChatChunksStream call. Compile-time layout guards break the build if either struct ever drifts. Co-Authored-By: Virgil <virgil@lethean.io>

… allocs -99% bytes DistillationBatchLoss is the per-step training kernel — runs once per gradient step. Inside, three vocab-sized []float64 scratch buffers (teacherScratch, teacherProbScratch, studentScratch) replaced per-token allocations inside the log-softmax helpers (good). But the BUFFERS THEMSELVES were re-made every call to DistillationBatchLoss — the cap-check fired on the first iteration of each call, allocated three vocab*float64 slices, and dropped them at function exit. That's 3 × vocab × 8 B per call (e.g. 3 × 32k × 8 = 768 KB) of pure GC churn in a tight training loop. This pass pools the three scratch buffers across calls: - sync.Pool of *[]float64 per buffer (W10-G *Array pool routing pattern). Wrap the slice header in *[]T so sync.Pool retains a pointer (no per-Get/Put interface escape) and any cap grow flows back into the pool on the next Put. - Vocab is essentially process-invariant (tokenizer-fixed), so the pool entries warm to the right cap after the first call and every subsequent invocation lifts pre-sized buffers off the pool. - The inner-loop cap-grow path stays — handles the (rare) per-cell vocab variation case — but reuses the existing pool pointer instead of leaking it. Subsequent grows within one call mutate *ptr in place rather than Get'ing a fresh wrapper. - Explicit Put on every error-return path (mask-on, mask-off, function end). Deliberately avoiding 'defer' — a deferred Put closure heap-allocates the defer record on every call, which would re-introduce the alloc the pool is trying to eliminate. Bench coverage extended in the same commit: BatchLoss + BatchLossNoMask + BatchCacheKey exercising the per-step pipeline. The bench shape (B=2 S=8 V=128) is small enough for high b.N but exercises the masked-path inner loop and the log-softmax + prob accumulator. Result on M3 Ultra (apple/arm64): BatchLoss (mask): 3 allocs / 3072 B / 29.3 μs → 0 / 0 / 29.3 μs ^^^^^^^^^^ BatchLossNoMask: 3 allocs / 3072 B / 29.0 μs → 0 / 0 / 29.5 μs Allocs: -100% bytes -99.97% time flat Time is flat because the kernel is compute-dominated (vocab × tokens fma ops on the inner loop); the saving is purely in GC pressure / heap arena turnover. Across a 1000-step run with 32k vocab that's ~750 MB of saved allocation pressure on the GC. Bit-exact training output preserved — the buffers are zero'd on make-new (Go guarantees zero-init), and per-cell logSoftmax overwrites every position before read. Pool entries that flow back in are reslice'd to [:vocab] so any tail-stale values are out of view. W10 lever: W10-V sync.Pool of *[]T for vocab-sized scratch generalised. The benchmark file extension is the contract — future distill regressions on this surface land in CI.

…itecture 11.0→2.17 ns (-81%), Empty 6.2→2.19 ns (-65%); fires once per Format so the savings compound across every chat-completion handler

Every call site passes exactly two strings (r.Response,r.Completion; r.Problem,r.Question; r.Thinking,r.Reasoning; r.Solution,r.Answer). The `...string` form materialised a 2-slot []string per call and walked it via range. Swap to firstNonEmpty(a, b string) — a strict specialisation of the existing two-element pattern that skips the slice header build + range iteration. bench delta (300ms, 2 runs): LoadJSONL_PromptResp_1000 609 µs → 528 µs -13% LoadJSONL_Reasoning_1000 693 µs → 600 µs -13% LoadJSONL_Mixed_1000 933 µs → 829 µs -11% Allocs unchanged. Bit-exact output (same trim-then-pick semantics). Co-Authored-By: Virgil <virgil@lethean.io>

…m -10-22%, ModelRoot -20%; options/helpers at floor)

…g cache + BatchLoss scratch pool — EmitProbe 5→1 alloc -80% time, BatchLoss 3→0 allocs -100% B)

…ine-via-fast-path — NormaliseRole_Canonical 6.04→0.62 ns -90% ~10×, TemplateName_Architecture -80%, Format paths -23-32%; fixed Builder under-alloc bug for aliased roles)

…4% / -64 B / 1 alloc gone All four callers of toRootAdapterInfo pass slices that the metal side has already cloned for caller isolation: * toRootMetrics — metrics.Adapter assigned via m.Adapter() (clones) * adapterFromNativeInfo + (*Model).Adapter — info.Adapter via m.Info() → m.Adapter() (clones) * inference_contract.go — passes adapter.model.Adapter() directly The core.SliceClone(info.TargetKeys) at this layer was a second clone of an already-isolated slice. Share the slice directly instead. Benchmarks (M3 Ultra): Typical (4 keys): 30.32 ns / 64 B / 1 → 4.92 ns / 0 B / 0 -84% / -100% bytes Empty: 5.53 ns / 0 B / 0 → 4.91 ns / 0 B / 0 -11% (loop drop) Fires once per Info() / Metrics() / Adapter() read on any LoRA-loaded model. Comment block at the function pins the contract — every caller must pre-clone before passing, so future drift through new call sites is caught at review time. Co-Authored-By: Virgil <virgil@lethean.io>

…er redundant-clone drop Companion to the previous toRootAdapterInfo commit. Confirms the 0-alloc / 0 B baseline propagates through toRootMetrics on the LoRA-loaded code path (where Adapter.TargetKeys is non-empty). Co-Authored-By: Virgil <virgil@lethean.io>

The internal LoadJSONL pipeline runs chat.NormaliseRole on every message Role at assembly time (appendMessagesFromOpenAI and appendMessagesFromShareGPT both normalise before appending to the slice), so the reverse-scan in MessagesToSample always sees the canonical "assistant" string in the common case. Add a `role == "assistant"` direct comparison ahead of the NormaliseRole fallback — the cross-package function call is skipped on every match, and the fallback still runs for any external caller passing un-normalised roles ("gpt", "MODEL", " bot "). bench delta (300ms, 5 runs): MessagesToSample_AssistantTail 204 ns → 137 ns -33% MessagesToSample_UserTail 202 ns → 148 ns -27% MessagesToSample_10Turn 332 ns → 241 ns -27% LoadJSONL_OpenAI_1000 1574 µs → 1354 µs -14% LoadJSONL_ShareGPT_1000 1105 µs → 937 µs -15% Allocs unchanged. Behaviour bit-exact — NormaliseRole is idempotent on canonical inputs (verified across user/assistant/system + gpt/bot/ model/human) so the short-circuit returns the same boolean. Co-Authored-By: Virgil <virgil@lethean.io>

…B per-row stack copy jsonRecord is a 14-field struct (12 strings × 16 B + 2 slice headers × 24 B = ~256 B). The value-receiver form was memcpy'ing the whole struct into toSample's frame on every row — 256 KB of stack movement across a 1000-row corpus before any per-row work began. Switching to *jsonRecord passes a single 8-byte pointer; the method is read-only over r (no field mutation), so call-site semantics are identical. bench delta (400ms, 4 runs, mean): LoadJSONL_TextOnly_1000 518 µs → 419 µs -19% LoadJSONL_TextOnly_10000 4.89 ms → 4.24 ms -13% LoadJSONL_Alpaca_1000 871 µs → 753 µs -14% LoadJSONL_ShareGPT_1000 1105 µs → 955 µs -14% LoadJSONL_Reasoning_1000 600 µs → 576 µs -4% LoadJSONL_Mixed_1000 829 µs → 784 µs -5% Allocs unchanged. Behaviour bit-exact — pointer dereference reads the same values as the value copy, and toSample never writes through r. Co-Authored-By: Virgil <virgil@lethean.io>

…/-1 alloc (double-clone bug fixed), toRootProbeLogits -43% W8-A2 unsafe cast, chatMessagesAsMetal reinterpret 0-alloc 0-cost)

…dJSONL TextOnly -63%, NewJSONL_1000 -73%; per-field record zero + samples pre-size + pointer receiver + firstNonEmpty two-arg + assistantIdx raw-compare)

W10-T forward-flagged the missing coverage: tokenizer.ggml.tokens arrays at 10k / 200k entries are the dominant ReadInfo cost for any gguf that embeds its vocabulary, but the existing benches stop at the 200-string proxy. Add two benches that frame a synthetic vocab through the same ggufValueTypeArray / ValueTypeString path the production reader walks, plus extend the bench harness's writeTestGGUFForBench to encode ggufArraySpec entries. Baseline on Apple M3 Ultra (-benchtime=2s): TokeniserVocab_10k 443µs/op 437kB/op 19915 allocs/op TokeniserVocab_200k 8.72ms/op 8.25MB/op 399915 allocs/op Per-entry cost = 2 allocs (one for the string box into []any, one for the string payload itself). The forward note flagged the boxed-element cost as the specialisation target. Co-Authored-By: Virgil <virgil@lethean.io>

Walk the name directly into the final "NN-body" buffer past the prefix bytes instead of building a separate body slice and re-allocating for the final concat. Reserve fallback capacity ("chapter-N" shape) up front so the empty- name path stays single-alloc rather than growing twice. BenchmarkSlug_Empty 83.4ns/103B/2 → 31.8ns/32B/1 allocs (-62% ns, -69% B, -50% allocs) BenchmarkSlug_Clean 52.4ns/47B/1 → 37.4ns/32B/1 allocs (-29% ns, -32% B) BenchmarkSlug_MixedCase 135.0ns/127B/2 → 121.6ns/127B/2 (slight) BenchmarkBundleURI 75.2ns/111B/2 → 60.5ns/95B/2 (-20% ns, -14% B) Co-Authored-By: Virgil <virgil@lethean.io>

W10-T forward-flagged: tokenizer.ggml.tokens with 200k+ vocab entries dominates ReadInfo cost on tokeniser-embedded gguf models. The generic readGGUFValue path was boxing each string into a []any slot — one extra alloc and one 16-B interface header per token. readGGUFValue now detects elementType == ValueTypeString inside the ggufValueTypeArray branch and returns []string directly, calling readStringIntoArena / readGGUFString through a tight switch-free loop that never widens through interface{}. metadataArrayLen already handled both []any and []string, so the only production-code consumer (vocab-size inference) was free. The in-package roundtrip test pattern-matched []any directly — updated to a type switch that accepts either shape (non-string element types still land as []any). Apple M3 Ultra, benchtime=2s: TokeniserVocab_10k 443µs → 343µs (-22.6% time) 437kB → 277kB (-36.6% bytes) 19915 → 9915 allocs (-50.2%, exactly -1 alloc per vocab entry) TokeniserVocab_200k 8.72ms → 5.70ms (-34.7% time) 8.25MB → 5.05MB (-38.8% bytes) 399915 → 199915 allocs (-50.0%, exactly -1 alloc per vocab entry) Co-Authored-By: Virgil <virgil@lethean.io>

Factor the slug body writer into appendSlugBody so bundleURI can append the canonical "NN-body" fragment directly into a buffer already carrying the URI prefix, then append the "/bundle" suffix. Drops the intermediate slug string + outer concat to a single allocation. slug() retains its public test/bench shape via the same helper. BenchmarkBundleURI 60.5ns/95B/2 → 43.6ns/64B/1 alloc (-28% ns, -33% B, -50% allocs) Slug benchmarks unchanged (within bench noise, slight function-call overhead). Co-Authored-By: Virgil <virgil@lethean.io>

… TokeniserVocab_200k 8.72→5.70 ms -34.7%, -200k allocs per model load)

Hoist nine constant-message error constructors out of validateStoreKind, Run, runChapter, and resultError into package-scope sentinels so repeated validation/short-circuit paths do not allocate a fresh *Err on every call. Adds chapterFault as the sentinel-friendly sibling of chapterError; messages remain stable so JSON Error fields read identically. BenchmarkValidateStoreKind_Bad 0 B/0 allocs (was alloc-per-call) BenchmarkRun_Bad_MissingGenerate 80 B/1 alloc (remaining alloc is cfg clone) Co-Authored-By: Virgil <virgil@lethean.io>

Pass len(loadedBundle.Blocks) as an upper bound for unique chunk reads when runChapter wires the counting store under runner.Generate. Adds newCountingStoreHint so callers with a known cap skip the Generate-time map-grow rehashes; newCountingStore stays as the zero-hint default. BenchmarkCountingStore_Hinted_FillsExpected 678ns/2344B/3 allocs BenchmarkCountingStore_Unhinted_FillsExpected 2456ns/4456B/9 allocs -72% ns, -47% B, -67% allocs at N=64 Co-Authored-By: Virgil <virgil@lethean.io>

…prefix escapes into 1 writeBlockCacheHashString allocated a stack [4]byte length prefix per call and leaked it into hash.Hash.Write through the interface boundary. Four calls per blockRefs / blockCacheID meant four heap allocations for the length prefixes alone. writeBlockCacheHeader composes all four length-prefixed identity strings into a single buffer and writes them once. Bench delta (per call, M3 Ultra, benchtime=200ms): BlockCacheID_512TokenPrefix: 1258 → 1171 ns (-7%), 8 → 5 allocs BlockCacheID_2048TokenPrefix: 4272 → 4274 ns, 8 → 5 allocs WarmCache_Miss_512Tokens: 3589 → 3371 ns (-6%), 16 → 13 allocs WarmCache_Miss_2048Tokens: 8741 → 8621 ns, 28 → 25 allocs Three fewer allocations per WarmCache_Miss and per blockCacheID — load- bearing on the prompt-cache warm path because every block insertion fires the hash header setup. Co-Authored-By: Virgil <virgil@lethean.io>

Add a focused benchmark for the already-flat [1,vocab] lastTokenLogits shape and record the rejected handle-clone probe in GOAL.md. The runtime remains on the existing Reshape2 path because the matched retained request-context row regressed. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

Use DefaultProductionLane for the plain driver-profile generation and reporting defaults so the fast lane no longer falls back to the older one-run 32-token smoke shape. Co-Authored-By: Virgil <virgil@lethean.io>

Add a driver-profile fast-lane setup benchmark and pre-size the restore function list from the accepted Gemma 4 gate count. This removes one setup allocation and 96 bytes per plain and hyper-long default application. Co-Authored-By: Virgil <virgil@lethean.io>

Measure driverProfileRuntimeGates with the accepted fast-lane gates active so report-path allocations are visible before the next optimisation pass. Co-Authored-By: Virgil <virgil@lethean.io>

Allocate the runtime-gate report map lazily and size it for the accepted Gemma 4 gate set. BenchmarkDriverProfileRuntimeGates_DefaultFastLane drops from 952 B / 5 allocs to 664 B / 4 allocs per call. Co-Authored-By: Virgil <virgil@lethean.io>

Write retained reference-turn material directly into the chat-template prompt builder instead of building an intermediate wrapped string first. BenchmarkStateRampProfileTurnPrompt_Gemma4WholeTurn drops from about 541 ns / 4128 B / 4 allocs to about 180 ns / 1312 B / 2 allocs. Co-Authored-By: Virgil <virgil@lethean.io>

Strip Gemma 4 thought/control markers with a single pass instead of repeated replace/split calls. BenchmarkStateRampProfileVisibleOutput_Gemma4LongThoughtBlock drops from about 172 ns / 224 B / 4 allocs to about 110 ns / 128 B / 1 alloc. Co-Authored-By: Virgil <virgil@lethean.io>

Pre-size token phase trace storage for traced fresh and retained generation using MaxTokens. BenchmarkTokenPhaseTraceAppend_Preallocated1024 records about 34 us / 200 KB / 1 alloc versus about 89 us / 695 KB / 12 allocs for the nil-slice append path. Co-Authored-By: Virgil <virgil@lethean.io>

Keep hidden-output driver profiles timing-only by omitting decoded sampled text unless full output is requested. Add an explicit token phase text generation option for diagnostics that need decoded text. Co-Authored-By: Virgil <virgil@lethean.io>

Use the production long-form token budget for retained state-ramp turns instead of the old 1024-token smoke default. Keep the JSON contract proving the default is 8192 tokens. Co-Authored-By: Virgil <virgil@lethean.io>

Co-Authored-By: Virgil <virgil@lethean.io>

Use a caller-owned stack slice when evaluating prefill cache state on the production cache-only and detach paths, while keeping the allocating compatibility helper for existing callers and tests. Benchmarks show the 26-cache Gemma 4 collector path moving from 153.6 ns/op, 416 B/op, 1 alloc/op to 109.1 ns/op, 0 B/op, 0 allocs/op. Co-Authored-By: Virgil <virgil@lethean.io>

Clear MLX allocator cache pressure between heavy paged-cache benchmark iterations via a raw internal helper, keeping runtime ClearCache behaviour unchanged while preventing broad benchmark sweeps from accumulating excessive active/cache memory. Verified with focused paged-cache benches, focused ClearCache/cache tests, go test ./go/internal/metal -count=1, and go test ./go/... -count=1. Co-Authored-By: Virgil <virgil@lethean.io>

Use stack-backed start/end slices for Gemma 4 gate/up split helpers so the MoE decode split path no longer allocates fresh slice metadata every call. Adds BenchmarkExpertIDSplitLastDimArray_Gemma4Decode; focused bench drops from 3 allocs/op to 2 allocs/op. Verified focused fused-gate tests, go test ./go/internal/metal -count=1, go test ./go/... -count=1, rebuilt lthn-mlx, and driver-profile smoke at 118.2 tok/s. Co-Authored-By: Virgil <virgil@lethean.io>

sonarqubecloud · 2026-05-25T16:10:32Z

Quality Gate failed

Failed conditions
3 Security Hotspots
6.6% Duplication on New Code (required ≤ 3%)
E Security Rating on New Code (required ≥ A)
C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarQube Cloud

Catch issues before they fail your Quality Gate with our IDE extension SonarQube for IDE

coderabbitai Bot requested changes May 20, 2026

View reviewed changes

Comment thread go/backend.go

github-advanced-security AI found potential problems May 20, 2026

View reviewed changes

coderabbitai Bot approved these changes May 22, 2026

View reviewed changes

Snider and others added 23 commits May 23, 2026 01:49

perf(chat): split normaliseRole into inline-able canonical fast path …

34a4f1e

…+ slow alias path — NormaliseRole_Canonical 6.0→0.62 ns (-90%, ~10× faster), Format paths inline the common-case switch (no fn call per message for user/assistant/system inputs)

perf(chat): templateName canonical fast path before Trim/Lower — Arch…

432cc54

…itecture 11.0→2.17 ns (-81%), Empty 6.2→2.19 ns (-65%); fires once per Format so the savings compound across every chat-completion handler

merge: cladius-lane-Wave10-W10AL (medium.go inline-concat — WalkMediu…

d1f9392

…m -10-22%, ModelRoot -20%; options/helpers at floor)

merge: cladius-lane-Wave10-W10AM (distill EmitProbe pool + temp strin…

2ddf53d

…g cache + BatchLoss scratch pool — EmitProbe 5→1 alloc -80% time, BatchLoss 3→0 allocs -100% B)

merge: cladius-lane-Wave10-W10AJ (chat normaliseRole/templateName inl…

2d4b759

…ine-via-fast-path — NormaliseRole_Canonical 6.04→0.62 ns -90% ~10×, TemplateName_Architecture -80%, Format paths -23-32%; fixed Builder under-alloc bug for aliased roles)

merge: cladius-lane-Wave10-W10AN (backend.go — toRootAdapterInfo -84%…

050db93

…/-1 alloc (double-clone bug fixed), toRootProbeLogits -43% W8-A2 unsafe cast, chatMessagesAsMetal reinterpret 0-alloc 0-cost)

merge: cladius-lane-Wave10-W10AK (dataset/jsonl.go massive wins — Loa…

fa7089a

…dJSONL TextOnly -63%, NewJSONL_1000 -73%; per-field record zero + samples pre-size + pointer receiver + firstNonEmpty two-arg + assistantIdx raw-compare)

merge: cladius-lane-Wave11-W11I (gguf []string array specialisation —…

d35b532

… TokeniserVocab_200k 8.72→5.70 ms -34.7%, -200k allocs per model load)

Snider and others added 29 commits May 25, 2026 12:14

test(metal): record scalar token sync probe

27e28a6

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): measure sample eval boundary

01127d4

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): measure mixed kv attention dtype

cd184d5

Co-Authored-By: Virgil <virgil@lethean.io>

test(metal): pin production prefetch benchmark shape

053eda0

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): collapse compiled callone boundary

0a118eb

Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): record callone retained workflow proof

e2fae30

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): collapse two-array concat boundary

260f041

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): tighten paged dirty state marking

d9bd011

Co-Authored-By: Virgil <virgil@lethean.io>

bench(metal): cover categorical sampler key path

e84e3a8

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): create decode token inputs directly

ca0aad3

Co-Authored-By: Virgil <virgil@lethean.io>

docs(runtime): clarify benchmark handover state

ac64e4d

Co-Authored-By: Virgil <virgil@lethean.io>

docs(todo): add current handover checkpoint

138dec1

Co-Authored-By: Virgil <virgil@lethean.io>

fix(cmd): align driver profile defaults

40f36cf

Use DefaultProductionLane for the plain driver-profile generation and reporting defaults so the fast lane no longer falls back to the older one-run 32-token smoke shape. Co-Authored-By: Virgil <virgil@lethean.io>

bench(cmd): cover runtime gate reporting

e54d1d9

Measure driverProfileRuntimeGates with the accepted fast-lane gates active so report-path allocations are visible before the next optimisation pass. Co-Authored-By: Virgil <virgil@lethean.io>

fix(cmd): keep trace text opt-in

f98c5e9

Keep hidden-output driver profiles timing-only by omitting decoded sampled text unless full output is requested. Add an explicit token phase text generation option for diagnostics that need decoded text. Co-Authored-By: Virgil <virgil@lethean.io>

fix(cmd): default state ramp to long-form turns

6383f55

Use the production long-form token budget for retained state-ramp turns instead of the old 1024-token smoke default. Keep the JSON contract proving the default is 8192 tokens. Co-Authored-By: Virgil <virgil@lethean.io>

docs(goal): record current binary smoke

463a072

Co-Authored-By: Virgil <virgil@lethean.io>

perf(metal): share native paged scratch

6c5b1cd

Co-Authored-By: Virgil <virgil@lethean.io>

docs(repo): refresh handover state

df5c8ac

Co-Authored-By: Virgil <virgil@lethean.io>

chore(cmd): pin native paged gate contract

a332176

Co-Authored-By: Virgil <virgil@lethean.io>

Conversation

Snider commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-advanced-security AI left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud Bot commented May 25, 2026

Quality Gate failed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Snider commented May 20, 2026 •

edited

Loading

coderabbitai Bot commented May 20, 2026 •

edited

Loading