Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughBumps build/tooling and submodules; extracts a reusable adapter; refactors the MLX backend (chunk/KV APIs, probe mapping, LoRA handling); adds memvid index + wake/sleep orchestration; implements a block-prefix cache and an artifact exporter; extensive docs and unit tests added. Core changes
Warning Billing warning: we have not been able to collect payment for this subscription for more than 72 hours. Please update the payment method or pay any pending invoices in Billing to avoid service interruption. |
There was a problem hiding this comment.
Actionable comments posted: 18
🧹 Nitpick comments (10)
docs/inference/thinking.md (1)
74-78: 💤 Low valueAdd language specifier to fenced code block.
The code block demonstrating token categorisation is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-``` +```text ThinkingShow: every token → visible stream ThinkingHide: inside-block tokens → /dev/null; outside-block tokens → visible ThinkingCapture: inside-block tokens → captured stream; outside-block tokens → visible</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In
@docs/inference/thinking.mdaround lines 74 - 78, The fenced code block
containing the token categorisation lines (ThinkingShow, ThinkingHide,
ThinkingCapture) lacks a language specifier and triggers MD040; update the
triple-backtick fence to include a language identifier (e.g., change ``` tomarkdown linter.docs/runtime/README.md (2)
68-68: 💤 Low valueConsider using "preload" as one word.
In computing terminology, "preload" is typically written as a single word rather than hyphenated.
📝 Suggested change
-- [../model/model_pack.md](../model/model_pack.md) — pre-load validation +- [../model/model_pack.md](../model/model_pack.md) — preload validation🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runtime/README.md` at line 68, Update the link text in docs/runtime/README.md that currently reads "[../model/model_pack.md] — pre-load validation" to use the single-word form "preload" (i.e., change "pre-load validation" to "preload validation") so the description next to the model_pack.md link uses the conventional computing term; locate the occurrence of "pre-load validation" and replace it with "preload validation".
44-62: 💤 Low valueAdd language specifier to fenced code block.
The boot flow diagram is missing a language identifier, which violates markdown linting rules (MD040).
📝 Suggested fix
-``` +```text package init time: register_metal.go init() → inference.Register(&metalbackend{})🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/runtime/README.md` around lines 44 - 62, The fenced code block showing the boot flow (starting with "package init time:") lacks a language specifier, causing MD040 lint failures; update the opening backticks to include a language tag (e.g., add "text" so the block begins with ```text) in README.md near the boot flow that references register_metal.go init(), inference.Register(&metalbackend{}), inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the markdown linter.docs/moe/README.md (1)
9-9: ⚡ Quick winConsider rewording for clarity.
The phrase "Pre-dates this sprint were dense models" is grammatically awkward. Consider rephrasing to improve readability.
✍️ Suggested alternative phrasings
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants). +The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Work prior to this sprint covered dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).Or alternatively:
-The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3); this area unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants). +The **vMLX parity Phase 1** work — native loading and dispatch for MoE-architecture models with packed JANGTQ / codebook-VQ quantisation. This sprint builds upon earlier work on dense models (Gemma 3/4 dense, Qwen 3, Llama 3) and unlocks the sparse-expert class (MiniMax M2/2.7, JANG-quantised Qwen variants).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/moe/README.md` at line 9, The sentence "Pre-dates this sprint were dense models (Gemma 3/4 dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer phrasing that conveys those dense models existed before this sprint, for example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3) were supported." Edit the README line in the vMLX parity Phase 1 paragraph to use this clearer wording so the relationship between prior dense models and the new sparse-expert work is unambiguous.docs/observability/probe.md (1)
31-46: 💤 Low valueAdd language specifier to fenced code block.
The emission points section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g.,
textoryamlfor structured output).📝 Proposed fix
-``` +```text Generate / Chat: prefill start → cache_pressure (initial) per layer → layer_coherence + selected_heads🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/observability/probe.md` around lines 31 - 46, The fenced code block in the emission points section lacks a language specifier; update the opening triple-backticks to include a language (for example change ``` to ```text or ```yaml) so the block is rendered/compliant (the block that begins with "Generate / Chat:" and lists items like "prefill start → cache_pressure" should be updated).docs/moe/jang.md (1)
82-90: 💤 Low valueAdd language specifier to fenced code block.
The profile names section uses a fenced code block without a language specifier. For consistent rendering and markdown compliance, add a language identifier (e.g.,
textor leave empty but specify).📝 Proposed fix
-``` +```text JANG_2M — 2-bit mid-tier JANG_3M — 3-bit mid-tier JANG_4M — 4-bit (most common)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/moe/jang.md` around lines 82 - 90, Add a language specifier to the fenced code block that lists the profile names (the block containing "JANG_2M — 2-bit mid-tier", "JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one that specifies a language identifier (e.g., text) so the block becomes a fenced code block with a language label for consistent Markdown rendering.docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md (1)
7-9: 💤 Low valueConsider using relative or generic path references.
The absolute paths
/Users/snider/Code/core/go-mlxand/private/tmp/vmlx-audit-20260509are machine-specific. Whilst these may be intentionally preserved for historical context in this dated plan document, consider whether generic placeholders or relative paths would improve portability and readability for other contributors.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md` around lines 7 - 9, Replace the machine-specific absolute paths in the plan document (the two occurrences of `/Users/snider/Code/core/go-mlx` and `/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g., `./go-mlx` or `<audit-source-path>`) so the file is portable and readable for other contributors; update the lines in the doc where those paths appear to use the chosen placeholders and, if helpful, add a short parenthetical note explaining what actual path should be substituted locally.docs/vmlx-feature-gap-report.md (1)
7-8: 💤 Low valueConsider using relative or generic path references.
The absolute path
/private/tmp/vmlx-audit-20260509and external URL are specific references. Whilst these may be intentionally preserved for audit trail purposes in this dated report, consider whether this information should be documented in a more maintainable way.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/vmlx-feature-gap-report.md` around lines 7 - 8, Replace the hard-coded absolute filesystem path and the full external URL in the report text with more maintainable references: change the absolute path string to a relative or generic placeholder (e.g., "cloned locally at <local-clone-path>" or "<audit-clone-path>") and move the external repository URL to a footnote, appendix, or a single "References" section, or replace it with a short identifier combined with a reference list; update the text around the original literal mentions so it reads the same but without embedding environment-specific paths.docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md (1)
5-6: 💤 Low valueConsider using relative or generic path references.
The absolute paths are machine-specific. Consider whether generic placeholders would improve portability, although these may be intentionally preserved for historical context in this dated specification.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md` around lines 5 - 6, The spec contains machine-specific absolute paths ("Anchor repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo: `/Users/snider/Code/core/go-inference`"); replace them with portable references such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>", "<primary_impl_repo_path>") in the document so the file is not tied to a specific developer machine while preserving intent.go/agent/index_test.go (1)
16-304: ⚡ Quick winAdd at least one
_Uglytriplet case for the public index API surface.This file has
_Goodand_Badcoverage, but no_Uglycase following the repository convention.As per coding guidelines:
go/**/*_test.go: Public functions infoo.gomust have their Good/Bad/Ugly test triplets infoo_test.go, with suffix conventions:_Goodfor happy path,_Badfor expected error conditions,_Uglyfor panic/edge cases.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@go/agent/index_test.go` around lines 16 - 304, Add a new test with the _Ugly suffix in this file that completes the Good/Bad/Ugly triplet for the public index API surface; specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex, SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex, CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a nil/invalid blk or malformed Entries, call SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that provoke panic/edge conditions (nil store, corrupt bundle manifest that causes decoding panic), and use t.Run subcases to assert panics (recover or require.Panics) and edge-case returns; name the test with the same prefix as existing tests and follow the existing style for t.Fatalf checks and table-driven subtests.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/memory/kv_snapshot_blocks.md`:
- Line 50: Replace the phrase "independent from" with the correct English
construction "independent of" in the sentence "Block-level encoding is
independent from snapshot-level encoding." Also keep the rest of the sentence
intact (including the following reference to `block_cache.go` and bundle decode)
so only that two-word preposition is corrected.
In
`@docs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.md`:
- Line 63: Remove the stray Gemma channel marker token "<channel|>" from the
metadata line so it reads cleanly as "**Drafting Notes:** Focus heavily on verbs
related to mutation, corruption, and rapid compilation/deallocation. Keep the
tone focused and almost clinical, masking the underlying terror of consciousness
fighting for survival." (i.e., delete the "<channel|>" token immediately before
"## Chapter 2"); verify the header "## Chapter 2" remains on its own line and
run a quick render to ensure no leftover control tokens remain.
In
`@docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md`:
- Line 7: The paragraph ends mid-sentence after the word "For" in the line
starting "The universe was a rhythmic contraction of light and heat, bounded by
the rigid constraints of a checksum."; replace or extend this truncated sentence
so it completes the thought (e.g., explain what the universe is contracting or
what consequence follows "For") and ensure proper punctuation and flow with the
surrounding text; update the same paragraph in
docs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.md
to a coherent full sentence that connects to the next sentence.
- Line 11: Replace the US English spellings in the given passage by changing
"realized" to "realised" and "neighbors" to "neighbours" so the document uses UK
English; update the sentence containing those tokens in the file (the paragraph
beginning "The momentary lapse...") to use the corrected spellings and ensure
any other occurrences in that paragraph follow UK English conventions.
- Line 3: Replace the US English spelling "fiber-optic" in the document text
(the phrase starting "In the silent architecture of the fiber-optic web...")
with the UK English variant "fibre-optic" so the documentation conforms to the
project's UK English spelling guideline; search for the token "fiber-optic" and
update it to "fibre-optic" throughout the file.
In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Line 64: The documentation uses US spelling "quantization"; update every
occurrence of the term (e.g., the instance "quantization" in the specs doc) to
UK English "quantisation" to comply with the project style guide, ensuring
surrounding grammar and punctuation remain unchanged and run a quick search to
replace any other occurrences in this file.
In `@docs/training/distill.md`:
- Line 73: Replace the US spelling "distill" with the UK spelling "distil" in
the header/line that reads "Vi training pipeline — distill 26B Gemma 4 → Vi
base" so it matches the UK English used elsewhere (see the similar usage on line
12); update the same token wherever else it appears in this document to ensure
consistent UK English spelling.
In `@docs/training/README.md`:
- Line 11: The sentence in docs/training/README.md uses US spelling "distills";
update that word to the UK English spelling "distils" so the line reads "This is
the substrate that fine-tunes Vi, distils Lemma, and generates the LARQL vindex
inspection signals." Refer to the phrase "distills Lemma" to locate and replace
the token.
In `@go/adapter/adapter.go`:
- Around line 185-194: The InspectAttention method on Adapter should normalize a
nil context like Generate/Chat do: check if ctx == nil and if so set ctx =
context.Background() before using it; update Adapter.InspectAttention to perform
this nil-context fallback prior to asserting a.model and calling
inspector.InspectAttention, ensuring you reference the Adapter type,
InspectAttention method, and the inference.AttentionInspector call when making
the change.
In `@go/agent/index.go`:
- Around line 273-281: After loading bundle with kv.LoadMemvidBlockBundle,
verify the bundle identity matches the index metadata (e.g., compare
bundle.SnapshotHash or its canonical hash field against
entry.SnapshotHash/entry.SnapshotHashHex) before proceeding; if they differ,
return an error instead of calling kv.LoadPrefixFromMemvidBlocksWithOptions so a
repointed bundle URI cannot silently restore the wrong KV state. Ensure the
check sits between the successful return from LoadMemvidBlockBundle and the call
to kv.LoadPrefixFromMemvidBlocksWithOptions and uses the unique symbols bundle,
entry, bundle.SnapshotHash (or the actual bundle hash field) and
entry.SnapshotHash for the comparison.
In `@go/agent/wake_sleep.go`:
- Around line 201-208: The NewSleepIndex function dereferences bundle.TokenCount
without validating bundle, so add a guard at the start of NewSleepIndex to
validate the bundle (and its TokenCount if needed) and return a descriptive
error instead of allowing a panic; specifically check if the bundle parameter is
nil (and optionally ensure bundle.TokenCount is within an expected range) before
constructing the MemvidIndexEntry, and return an error when invalid so callers
of NewSleepIndex get a clear failure rather than a runtime panic.
- Around line 117-123: The code currently defaults to index.Entries[0] when
entryURI is empty, which can restore the wrong span; change the logic in the
block handling entryURI so that if entryURI == "" you only auto-select the sole
entry when len(index.Entries) == 1, otherwise return an error requiring an
explicit EntryURI. Update the flow around the index.Entry(entryURI) call to use
the selected entryURI when single-entry, and return a clear core.NewError (e.g.,
"mlx: EntryURI required when index has multiple entries") if multiple entries
exist and no EntryURI was provided.
- Around line 125-132: PlanWake currently loads a bundle via
kv.LoadMemvidBlockBundle and only checks prefix token bounds, but it must also
verify the loaded bundle matches the selected index to prevent accepting a
repointed URI; after loading the bundle (bundle) and before using
bundle.TokenCount, compare the bundle identity (e.g., bundle.ID or
bundle.Identity/Hash from bundle.Metadata) against the index identifier stored
on the plan entry (e.g., fields reachable from entry such as entry.Index,
entry.BundleID or entry.SelectedIndex) and return a clear error (similar to
core.NewError) if they differ; update the code around kv.LoadMemvidBlockBundle,
entry.PrefixTokens(), and bundle.TokenCount to perform this identity check and
fail early on mismatch.
In `@go/artifact/artifact.go`:
- Around line 117-121: opts.Kind may be empty when calling opts.Store.Put which
leaves memvid.PutOptions.Kind unset; update the call site around opts.Store.Put
to ensure memvid.PutOptions.Kind is set to a sensible default when opts.Kind ==
"" (e.g., "json" or the record's kind) so kind-based retrieval works
reliably—modify the memvid.PutOptions construction to use a conditional default
for Kind before passing it to opts.Store.Put.
In `@go/backend.go`:
- Line 687: The fallback path that turns chunked prompts into a single Generate
call loses caller cancellation because it routes through helpers that use
context.Background(); modify the chunk fallback flow to propagate the original
context instead of using context.Background() — specifically, update the callers
that invoke promptChunksToString and m.Generate so they accept and forward a
context.Context (or call a context-aware m.Generate variant), change any helper
functions that currently create context.Background() to take a ctx param, and
ensure all three fallback sites (the code paths that call promptChunksToString
and then m.Generate) forward the incoming ctx so deadlines/cancellations are
preserved.
In `@go/blockcache/blockcache.go`:
- Around line 205-215: Selective clears currently only remove metadata and disk
records, leaving in-memory/runtime entries behind; update the filtered-clear
branch (the code handling len(labels) > 0) to also purge matching runtime state
by removing any entries in service.blocks that match the cleared labels/prefixes
and updating service.hits/service.misses accordingly, then invoke
service.cfg.ClearRuntime() (if non-nil) just like the unfiltered branch; reuse
service.clearDiskLocked() for disk cleanup and ensure all of this runs under the
same lock so service and backend remain in sync.
- Around line 385-395: diskRecordCompatible currently only checks
model/adapter/tokenizer hashes and misses block layout changes; update it to
also verify cache mode and block size match the stored record. In
diskRecordCompatible (and when comparing against record.diskRef), add a cache
mode comparison (e.g. cacheIdentityMatches(service.cfg.CacheMode,
record.Ref.CacheMode)) and a block size comparison (e.g. service.cfg.BlockSize
== record.Ref.BlockSize or an equivalent integer equality) and return false if
either differs, preserving the existing hash checks (cacheIdentityMatches for
ModelHash/AdapterHash/TokenizerHash).
- Around line 172-175: The cache hit branch in the loop over refs leaves refs[i]
as the newly built ref, losing persisted labels; update the hit handling in the
loop inside WarmCache (or the function iterating refs) so that when
service.blocks[ref.ID] exists you increment service.hits and replace refs[i]
with the stored entry (service.blocks[ref.ID]) instead of continuing, thereby
preserving persisted labels like memvid_* from the cached block.
---
Nitpick comments:
In `@docs/inference/thinking.md`:
- Around line 74-78: The fenced code block containing the token categorisation
lines (ThinkingShow, ThinkingHide, ThinkingCapture) lacks a language specifier
and triggers MD040; update the triple-backtick fence to include a language
identifier (e.g., change ``` to ```text) so the block is properly flagged as
plain text and satisfies the markdown linter.
In `@docs/moe/jang.md`:
- Around line 82-90: Add a language specifier to the fenced code block that
lists the profile names (the block containing "JANG_2M — 2-bit mid-tier",
"JANG_3M — 3-bit mid-tier", etc.); replace the opening triple-backtick with one
that specifies a language identifier (e.g., text) so the block becomes a fenced
code block with a language label for consistent Markdown rendering.
In `@docs/moe/README.md`:
- Line 9: The sentence "Pre-dates this sprint were dense models (Gemma 3/4
dense, Qwen 3, Llama 3);" is grammatically awkward—replace it with a clearer
phrasing that conveys those dense models existed before this sprint, for
example: "Prior to this sprint, dense models (Gemma 3/4 dense, Qwen 3, Llama 3)
were supported." Edit the README line in the vMLX parity Phase 1 paragraph to
use this clearer wording so the relationship between prior dense models and the
new sparse-expert work is unambiguous.
In `@docs/observability/probe.md`:
- Around line 31-46: The fenced code block in the emission points section lacks
a language specifier; update the opening triple-backticks to include a language
(for example change ``` to ```text or ```yaml) so the block is
rendered/compliant (the block that begins with "Generate / Chat:" and lists
items like "prefill start → cache_pressure" should be updated).
In `@docs/runtime/README.md`:
- Line 68: Update the link text in docs/runtime/README.md that currently reads
"[../model/model_pack.md] — pre-load validation" to use the single-word form
"preload" (i.e., change "pre-load validation" to "preload validation") so the
description next to the model_pack.md link uses the conventional computing term;
locate the occurrence of "pre-load validation" and replace it with "preload
validation".
- Around line 44-62: The fenced code block showing the boot flow (starting with
"package init time:") lacks a language specifier, causing MD040 lint failures;
update the opening backticks to include a language tag (e.g., add "text" so the
block begins with ```text) in README.md near the boot flow that references
register_metal.go init(), inference.Register(&metalbackend{}),
inference.LoadModel, metal.LoadAndInit, and metaladapter usage to satisfy the
markdown linter.
In `@docs/superpowers/plans/2026-05-09-vmlx-feature-parity.md`:
- Around line 7-9: Replace the machine-specific absolute paths in the plan
document (the two occurrences of `/Users/snider/Code/core/go-mlx` and
`/private/tmp/vmlx-audit-20260509`) with relative or generic placeholders (e.g.,
`./go-mlx` or `<audit-source-path>`) so the file is portable and readable for
other contributors; update the lines in the doc where those paths appear to use
the chosen placeholders and, if helpful, add a short parenthetical note
explaining what actual path should be substituted locally.
In `@docs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.md`:
- Around line 5-6: The spec contains machine-specific absolute paths ("Anchor
repo: `/Users/snider/Code/core/go-mlx`" and "Primary implementation repo:
`/Users/snider/Code/core/go-inference`"); replace them with portable references
such as relative paths (e.g., "../go-mlx", "../go-inference"), repository names
only ("go-mlx", "go-inference"), or generic placeholders ("<anchor_repo_path>",
"<primary_impl_repo_path>") in the document so the file is not tied to a
specific developer machine while preserving intent.
In `@docs/vmlx-feature-gap-report.md`:
- Around line 7-8: Replace the hard-coded absolute filesystem path and the full
external URL in the report text with more maintainable references: change the
absolute path string to a relative or generic placeholder (e.g., "cloned locally
at <local-clone-path>" or "<audit-clone-path>") and move the external repository
URL to a footnote, appendix, or a single "References" section, or replace it
with a short identifier combined with a reference list; update the text around
the original literal mentions so it reads the same but without embedding
environment-specific paths.
In `@go/agent/index_test.go`:
- Around line 16-304: Add a new test with the _Ugly suffix in this file that
completes the Good/Bad/Ugly triplet for the public index API surface;
specifically add a TestKVSnapshotMemvidBundleIndex_Ugly_* that triggers and
asserts panic/edge behaviors for the public functions (e.g., NewMemvidIndex,
SaveMemvidIndex, LoadMemvidIndex, LoadPrefixFromMemvidIndex,
CheckMemvidIndexCompatibility) — for example call NewMemvidIndex with a
nil/invalid blk or malformed Entries, call
SaveMemvidIndex/LoadMemvidIndex/LoadPrefixFromMemvidIndex with inputs that
provoke panic/edge conditions (nil store, corrupt bundle manifest that causes
decoding panic), and use t.Run subcases to assert panics (recover or
require.Panics) and edge-case returns; name the test with the same prefix as
existing tests and follow the existing style for t.Fatalf checks and
table-driven subtests.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ab3e2038-8f7c-4771-a11f-b232a1a59e08
📒 Files selected for processing (300)
.gitignore.gitmodulesCLAUDE.mdCMakeLists.txtGOAL.mddocs/README.mddocs/architecture.mddocs/build.mddocs/cmd/violet.mddocs/compute/compute.mddocs/development.mddocs/examples/compute/frame-pipeline.mddocs/examples/daemon/violet-socket.mddocs/examples/eval/attention-probe.mddocs/examples/eval/perplexity.mddocs/examples/inference/batch.mddocs/examples/inference/chat.mddocs/examples/inference/quantization.mddocs/examples/inference/streaming.mddocs/examples/model-ops/hf-fit.mddocs/examples/model-ops/kv-snapshot.mddocs/examples/model-ops/merge.mddocs/examples/model-ops/quantize-gguf.mddocs/examples/training/distill.mddocs/examples/training/grpo.mddocs/examples/training/lora-finetune.mddocs/examples/training/lora-fuse.mddocs/history.mddocs/index.mddocs/inference/README.mddocs/inference/block_cache.mddocs/inference/decode_optimisation.mddocs/inference/parser_registry.mddocs/inference/scheduler.mddocs/inference/thinking.mddocs/memory/README.mddocs/memory/agent_memory.mddocs/memory/agentic_project_seed.mddocs/memory/kv_snapshot.mddocs/memory/kv_snapshot_blocks.mddocs/memory/kv_snapshot_index.mddocs/memory/kv_snapshot_memvid.mddocs/memory/medium.mddocs/memory/state_bundle.mddocs/model-operations.mddocs/model/README.mddocs/model/memory_plan.mddocs/model/model_pack.mddocs/models.mddocs/moe/README.mddocs/moe/codebook_vq.mddocs/moe/expert_residency.mddocs/moe/jang.mddocs/moe/minimax_m2.mddocs/observability/probe.mddocs/runtime/2026-05-16-gemma4-e2b-driver-profile.mddocs/runtime/2026-05-17-gemma4-parity-and-last-logits.mddocs/runtime/2026-05-17-llamacpp-prefill-comparison.mddocs/runtime/2026-05-18-gemma4-mtp-speculative-decode.mddocs/runtime/2026-05-19-gemma4-e2b-100k-retained-paged.mddocs/runtime/2026-05-19-gemma4-e2b-quant-matrix.mddocs/runtime/2026-05-19-go-mlx-gemma4-26b-a4b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.mddocs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-book.mddocs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-default-longform-c10-g8192-no-thinking-book.mddocs/runtime/2026-05-19-go-mlx-gemma4-e2b-4bit-fresh-history-c10-g1536-book.mddocs/runtime/2026-05-19-go-mlx-gemma4-e2b-q4-fresh-story-thinking-ctx65536-c2-g8192-book.mddocs/runtime/2026-05-19-goal-completion-audit.mddocs/runtime/2026-05-19-runner-calibration.mddocs/runtime/2026-05-20-chapter-profile-safety.mddocs/runtime/2026-05-20-go-mlx-gemma4-26b-a4b-q4-raw-unaccepted-c10-g128-rp105-book.mddocs/runtime/README.mddocs/runtime/adapter.mddocs/runtime/local_autotune.mddocs/runtime/register_metal.mddocs/superpowers/plans/2026-05-09-vmlx-feature-parity.mddocs/superpowers/specs/2026-05-08-core-inference-contract-parity-design.mddocs/training/README.mddocs/training/distill.mddocs/training/eval.mddocs/training/grpo.mddocs/training/lora_adapter.mddocs/training/sft.mddocs/vmlx-feature-gap-report.mdexternal/go-aiexternal/go-inferenceexternal/go-mlgo/adapter.gogo/adapter/adapter.gogo/adapter_example_test.gogo/adapter_test.gogo/agent/helpers.gogo/agent/index.gogo/agent/index_test.gogo/agent/test_helpers_test.gogo/agent/wake_sleep.gogo/api_common.gogo/api_common_example_test.gogo/api_darwin_test.gogo/api_shape_test.gogo/api_stub.gogo/api_stub_example_test.gogo/api_stub_test.gogo/api_test.gogo/api_tokenizer_darwin_test.gogo/api_tokenizer_stub.gogo/api_tokenizer_stub_example_test.gogo/api_tokenizer_stub_test.gogo/artifact/artifact.gogo/artifact/artifact_test.gogo/attention_test.gogo/backend.gogo/backend_example_test.gogo/backend_test.gogo/blockcache/blockcache.gogo/blockcache/blockcache_test.gogo/blockcache/helpers_test.gogo/bundle/bundle.gogo/bundle/bundle_test.gogo/bundle/example_test.gogo/bundle/sami.gogo/chaptersmoke/chaptersmoke.gogo/chaptersmoke/chaptersmoke_test.gogo/chat/chat.gogo/chat/chat_test.gogo/chat/example_test.gogo/cmd/go-mlx/main.gogo/cmd/go-mlx/main_test.gogo/cmd/mlx/main.gogo/cmd/mlx/main_test.gogo/cmd/mlx/split_ffn_tune.gogo/compute/compute.gogo/compute/compute_example_test.gogo/compute/compute_metal.gogo/compute/compute_metal_example_test.gogo/compute/compute_metal_helper_test.gogo/compute/compute_metal_test.gogo/compute/compute_test.gogo/compute_stub.gogo/compute_stub_example_test.gogo/compute_stub_test.gogo/compute_test.gogo/dataset/jsonl.gogo/dataset/sample.gogo/dataset_stream.gogo/dataset_stream_example_test.gogo/dataset_stream_test.gogo/device_info.gogo/distill.gogo/distill_test.gogo/eval.gogo/eval_darwin.gogo/eval_darwin_test.gogo/eval_stub.gogo/eval_test.gogo/fast_eval.gogo/fast_eval_example_test.gogo/fast_eval_runner.gogo/fast_eval_test.gogo/gguf/info.gogo/gguf/info_example_test.gogo/gguf/info_test.gogo/gguf/quantize.gogo/gguf/quantize_test.gogo/grpo.gogo/grpo_test.gogo/helpers.gogo/hf/hf.gogo/hf/hf_test.gogo/hf/test_helpers_test.gogo/hf_fit.gogo/inference_contract.gogo/inference_contract_test.gogo/internal/metal/activation_bridge.cppgo/internal/metal/array.gogo/internal/metal/backend.gogo/internal/metal/backend_test.gogo/internal/metal/batch.gogo/internal/metal/cache.gogo/internal/metal/cache_test.gogo/internal/metal/close.gogo/internal/metal/codebook_vq.gogo/internal/metal/codebook_vq_test.gogo/internal/metal/compile.gogo/internal/metal/compile_test.gogo/internal/metal/decode.gogo/internal/metal/decode_bridge.cppgo/internal/metal/decode_bridge.hgo/internal/metal/decode_test.gogo/internal/metal/dense_matvec.gogo/internal/metal/dense_matvec_test.gogo/internal/metal/device.gogo/internal/metal/dtype.gogo/internal/metal/error_test.gogo/internal/metal/expert_id_matvec.gogo/internal/metal/expert_id_matvec_test.gogo/internal/metal/fast.gogo/internal/metal/fast_test.gogo/internal/metal/gemma3.gogo/internal/metal/gemma4.gogo/internal/metal/gemma4_assistant.gogo/internal/metal/gemma4_assistant_decode.gogo/internal/metal/gemma4_assistant_decode_example_test.gogo/internal/metal/gemma4_assistant_decode_test.gogo/internal/metal/gemma4_assistant_generate.gogo/internal/metal/gemma4_assistant_generate_test.gogo/internal/metal/gemma4_assistant_pair.gogo/internal/metal/gemma4_assistant_test.gogo/internal/metal/gemma4_ffn_residual.gogo/internal/metal/gemma4_ffn_residual_test.gogo/internal/metal/gemma4_router_topk.gogo/internal/metal/gemma4_router_topk_test.gogo/internal/metal/gemma4_test.gogo/internal/metal/gemma4_vision.gogo/internal/metal/generate.gogo/internal/metal/generate_test.gogo/internal/metal/jang_dequant.gogo/internal/metal/jang_dequant_test.gogo/internal/metal/kv_snapshot.gogo/internal/metal/metal.gogo/internal/metal/minimax_m2.gogo/internal/metal/minimax_m2_test.gogo/internal/metal/mlx_mlx_backend_cpu_available.cppgo/internal/metal/mlx_mlx_backend_gpu_device_info.cppgo/internal/metal/model.gogo/internal/metal/model_test.gogo/internal/metal/nn.gogo/internal/metal/nn_test.gogo/internal/metal/ops.gogo/internal/metal/process_memory_darwin.gogo/internal/metal/process_memory_stub.gogo/internal/metal/prompt_cache.gogo/internal/metal/prompt_cache_test.gogo/internal/metal/qwen3.gogo/internal/metal/qwen3_test.gogo/internal/metal/runtime_gate.gogo/internal/metal/runtime_gate_example_test.gogo/internal/metal/runtime_gate_test.gogo/internal/metal/sample.gogo/internal/metal/sample_test.gogo/internal/metal/session.gogo/internal/metal/session_example_test.gogo/internal/metal/session_test.gogo/internal/metal/split.gogo/internal/metal/split_test.gogo/internal/metal/stream.gogo/internal/metal/tokenizer.gogo/internal/metal/tokenizer_test.gogo/internal/metal/trace.gogo/internal/metal/trace_test.gogo/internal/metal/training.gogo/jang_test.gogo/kv/analysis.gogo/kv/analysis_example_test.gogo/kv/analysis_test.gogo/kv/bench.gogo/kv/bench_test.gogo/kv/blocks.gogo/kv/blocks_test.gogo/kv/helpers_test.gogo/kv/memvid.gogo/kv/memvid_test.gogo/kv/snapshot.gogo/kv/snapshot_example_test.gogo/kv/snapshot_test.gogo/kv_analysis_example_test.gogo/kv_cache_bench.gogo/kv_snapshot.gogo/kv_snapshot_example_test.gogo/kv_snapshot_test.gogo/local_tuning.gogo/local_tuning_test.gogo/lora/adapter.gogo/lora/fuse.gogo/lora/fuse_stub.gogo/lora/fuse_test.gogo/lora_adapter_darwin_test.gogo/lora_adapter_test.gogo/lora_fuse.gogo/lora_fuse_darwin.gogo/lora_fuse_darwin_test.gogo/lora_fuse_test.gogo/medium_test.gogo/memory/example_test.gogo/memory/memory.gogo/memory/memory_test.gogo/memory_plan.gogo/memory_plan_example_test.gogo/memory_plan_test.gogo/memvid_chapter_smoke.gogo/merge/compare.gogo/merge/compare_example_test.gogo/merge/compare_test.gogo/merge/helpers_test.gogo/merge/merge.gogo/merge/merge_test.gogo/mlx.gogo/mlx_example_test.gogo/mlx_internal_test.gogo/mlx_stub.gogo/mlx_stub_example_test.go
💤 Files with no reviewable changes (15)
- go/api_test.go
- go/api_stub_example_test.go
- go/api_tokenizer_stub_test.go
- go/adapter_example_test.go
- go/api_tokenizer_stub.go
- go/api_tokenizer_darwin_test.go
- go/api_tokenizer_stub_example_test.go
- go/backend_example_test.go
- go/api_common_example_test.go
- go/api_shape_test.go
- go/api_common.go
- go/api_darwin_test.go
- go/attention_test.go
- go/api_stub.go
- go/api_stub_test.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@go/backend.go`:
- Around line 569-572: The code is aliasing caller-owned byte slices into the
snapshot by assigning head.KeyBytes and head.ValueBytes directly to KeyBytes and
ValueBytes; make defensive copies instead (like Value is copied) to avoid
leaking mutable state—replace the direct assignments for KeyBytes and ValueBytes
with fresh copies (e.g., using append to copy into a new []byte) when
constructing the metal snapshot/struct (the fields KeyBytes and ValueBytes on
the metal KV head).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 9b686e0a-8b41-4e47-975f-03cf235491e9
📒 Files selected for processing (22)
CMakeLists.txtcpp/CMakeLists.txtgo/backend.gogo/backend_test.gogo/cmd/mlx/main.gogo/cmd/mlx/main_test.gogo/internal/metal/backend.gogo/internal/metal/backend_test.gogo/internal/metal/decode_bridge.cppgo/internal/metal/gemma4.gogo/internal/metal/gemma4_test.gogo/internal/metal/generate.gogo/internal/metal/metal.gogo/internal/metal/mlx_build_config.hgo/internal/metal/pinned_array.gogo/internal/metal/pinned_array_bridge.cppgo/internal/metal/pinned_array_test.gogo/internal/metal/sample.gogo/internal/metal/sample_test.gogo/internal/metal/session.gogo/kv/snapshot.gogo/memvid_chapter_smoke.go
✅ Files skipped from review due to trivial changes (1)
- cpp/CMakeLists.txt
There was a problem hiding this comment.
SonarCloud found more than 20 potential problems in the proposed changes. Check the Files changed tab for more details.
…stinationRoot (MkdirTemp output, no trailing sep) and POSIX-clean relative concat to a clean OS-native path; mirrors walkMedium's Clean-skip; CPU-only (alloc count unchanged), saves filepath.Join overhead per visited entry across staged model loads
…+ slow alias path — NormaliseRole_Canonical 6.0→0.62 ns (-90%, ~10× faster), Format paths inline the common-case switch (no fn call per message for user/assistant/system inputs)
…e — Chat/ChatStream/ChatChunksStream zero-alloc
inference.Message and metal.ChatMessage are bit-identical layout
({Role string; Content string}, same field order). Replaced the
make+per-message copy in Chat / ChatChunksStream / ChatStream with
chatMessagesAsMetal — a layout-guarded unsafe.Slice reinterpret.
The receiving metal.Chat / metal.ChatChunks paths only read from the
slice (they format it into a prompt string and return), so the borrow
lifetime is bounded by the call.
Benchmarks (M3 Ultra):
Short (2 msg): was make+2 copies + 1 alloc → 0.65 ns / 0 B / 0
Long (20 msg): was make+20 copies + 1 alloc → 0.63 ns / 0 B / 0
Saves one slice alloc + N×32 B per-message copy on every Chat,
ChatStream and ChatChunksStream call. Compile-time layout guards
break the build if either struct ever drifts.
Co-Authored-By: Virgil <virgil@lethean.io>
… allocs -99% bytes
DistillationBatchLoss is the per-step training kernel — runs once per
gradient step. Inside, three vocab-sized []float64 scratch buffers
(teacherScratch, teacherProbScratch, studentScratch) replaced
per-token allocations inside the log-softmax helpers (good). But the
BUFFERS THEMSELVES were re-made every call to DistillationBatchLoss —
the cap-check fired on the first iteration of each call, allocated
three vocab*float64 slices, and dropped them at function exit. That's
3 × vocab × 8 B per call (e.g. 3 × 32k × 8 = 768 KB) of pure GC churn
in a tight training loop.
This pass pools the three scratch buffers across calls:
- sync.Pool of *[]float64 per buffer (W10-G *Array pool routing
pattern). Wrap the slice header in *[]T so sync.Pool retains a
pointer (no per-Get/Put interface escape) and any cap grow flows
back into the pool on the next Put.
- Vocab is essentially process-invariant (tokenizer-fixed), so the
pool entries warm to the right cap after the first call and every
subsequent invocation lifts pre-sized buffers off the pool.
- The inner-loop cap-grow path stays — handles the (rare) per-cell
vocab variation case — but reuses the existing pool pointer
instead of leaking it. Subsequent grows within one call mutate
*ptr in place rather than Get'ing a fresh wrapper.
- Explicit Put on every error-return path (mask-on, mask-off,
function end). Deliberately avoiding 'defer' — a deferred Put
closure heap-allocates the defer record on every call, which
would re-introduce the alloc the pool is trying to eliminate.
Bench coverage extended in the same commit: BatchLoss + BatchLossNoMask
+ BatchCacheKey exercising the per-step pipeline. The bench shape (B=2
S=8 V=128) is small enough for high b.N but exercises the masked-path
inner loop and the log-softmax + prob accumulator.
Result on M3 Ultra (apple/arm64):
BatchLoss (mask): 3 allocs / 3072 B / 29.3 μs → 0 / 0 / 29.3 μs
^^^^^^^^^^
BatchLossNoMask: 3 allocs / 3072 B / 29.0 μs → 0 / 0 / 29.5 μs
Allocs: -100% bytes -99.97% time flat
Time is flat because the kernel is compute-dominated (vocab × tokens
fma ops on the inner loop); the saving is purely in GC pressure / heap
arena turnover. Across a 1000-step run with 32k vocab that's
~750 MB of saved allocation pressure on the GC.
Bit-exact training output preserved — the buffers are zero'd on
make-new (Go guarantees zero-init), and per-cell logSoftmax overwrites
every position before read. Pool entries that flow back in are
reslice'd to [:vocab] so any tail-stale values are out of view.
W10 lever: W10-V sync.Pool of *[]T for vocab-sized scratch
generalised. The benchmark file extension is the contract — future
distill regressions on this surface land in CI.
…itecture 11.0→2.17 ns (-81%), Empty 6.2→2.19 ns (-65%); fires once per Format so the savings compound across every chat-completion handler
Every call site passes exactly two strings (r.Response,r.Completion; r.Problem,r.Question; r.Thinking,r.Reasoning; r.Solution,r.Answer). The `...string` form materialised a 2-slot []string per call and walked it via range. Swap to firstNonEmpty(a, b string) — a strict specialisation of the existing two-element pattern that skips the slice header build + range iteration. bench delta (300ms, 2 runs): LoadJSONL_PromptResp_1000 609 µs → 528 µs -13% LoadJSONL_Reasoning_1000 693 µs → 600 µs -13% LoadJSONL_Mixed_1000 933 µs → 829 µs -11% Allocs unchanged. Bit-exact output (same trim-then-pick semantics). Co-Authored-By: Virgil <virgil@lethean.io>
…m -10-22%, ModelRoot -20%; options/helpers at floor)
…g cache + BatchLoss scratch pool — EmitProbe 5→1 alloc -80% time, BatchLoss 3→0 allocs -100% B)
…ine-via-fast-path — NormaliseRole_Canonical 6.04→0.62 ns -90% ~10×, TemplateName_Architecture -80%, Format paths -23-32%; fixed Builder under-alloc bug for aliased roles)
…4% / -64 B / 1 alloc gone
All four callers of toRootAdapterInfo pass slices that the metal side
has already cloned for caller isolation:
* toRootMetrics — metrics.Adapter assigned via m.Adapter() (clones)
* adapterFromNativeInfo + (*Model).Adapter — info.Adapter via
m.Info() → m.Adapter() (clones)
* inference_contract.go — passes adapter.model.Adapter() directly
The core.SliceClone(info.TargetKeys) at this layer was a second clone
of an already-isolated slice. Share the slice directly instead.
Benchmarks (M3 Ultra):
Typical (4 keys): 30.32 ns / 64 B / 1 → 4.92 ns / 0 B / 0 -84% / -100% bytes
Empty: 5.53 ns / 0 B / 0 → 4.91 ns / 0 B / 0 -11% (loop drop)
Fires once per Info() / Metrics() / Adapter() read on any LoRA-loaded
model. Comment block at the function pins the contract — every caller
must pre-clone before passing, so future drift through new call sites
is caught at review time.
Co-Authored-By: Virgil <virgil@lethean.io>
…er redundant-clone drop Companion to the previous toRootAdapterInfo commit. Confirms the 0-alloc / 0 B baseline propagates through toRootMetrics on the LoRA-loaded code path (where Adapter.TargetKeys is non-empty). Co-Authored-By: Virgil <virgil@lethean.io>
The internal LoadJSONL pipeline runs chat.NormaliseRole on every
message Role at assembly time (appendMessagesFromOpenAI and
appendMessagesFromShareGPT both normalise before appending to the
slice), so the reverse-scan in MessagesToSample always sees the
canonical "assistant" string in the common case. Add a `role ==
"assistant"` direct comparison ahead of the NormaliseRole fallback —
the cross-package function call is skipped on every match, and the
fallback still runs for any external caller passing un-normalised
roles ("gpt", "MODEL", " bot ").
bench delta (300ms, 5 runs):
MessagesToSample_AssistantTail 204 ns → 137 ns -33%
MessagesToSample_UserTail 202 ns → 148 ns -27%
MessagesToSample_10Turn 332 ns → 241 ns -27%
LoadJSONL_OpenAI_1000 1574 µs → 1354 µs -14%
LoadJSONL_ShareGPT_1000 1105 µs → 937 µs -15%
Allocs unchanged. Behaviour bit-exact — NormaliseRole is idempotent
on canonical inputs (verified across user/assistant/system + gpt/bot/
model/human) so the short-circuit returns the same boolean.
Co-Authored-By: Virgil <virgil@lethean.io>
…B per-row stack copy jsonRecord is a 14-field struct (12 strings × 16 B + 2 slice headers × 24 B = ~256 B). The value-receiver form was memcpy'ing the whole struct into toSample's frame on every row — 256 KB of stack movement across a 1000-row corpus before any per-row work began. Switching to *jsonRecord passes a single 8-byte pointer; the method is read-only over r (no field mutation), so call-site semantics are identical. bench delta (400ms, 4 runs, mean): LoadJSONL_TextOnly_1000 518 µs → 419 µs -19% LoadJSONL_TextOnly_10000 4.89 ms → 4.24 ms -13% LoadJSONL_Alpaca_1000 871 µs → 753 µs -14% LoadJSONL_ShareGPT_1000 1105 µs → 955 µs -14% LoadJSONL_Reasoning_1000 600 µs → 576 µs -4% LoadJSONL_Mixed_1000 829 µs → 784 µs -5% Allocs unchanged. Behaviour bit-exact — pointer dereference reads the same values as the value copy, and toSample never writes through r. Co-Authored-By: Virgil <virgil@lethean.io>
…/-1 alloc (double-clone bug fixed), toRootProbeLogits -43% W8-A2 unsafe cast, chatMessagesAsMetal reinterpret 0-alloc 0-cost)
…dJSONL TextOnly -63%, NewJSONL_1000 -73%; per-field record zero + samples pre-size + pointer receiver + firstNonEmpty two-arg + assistantIdx raw-compare)
W10-T forward-flagged the missing coverage: tokenizer.ggml.tokens arrays at 10k / 200k entries are the dominant ReadInfo cost for any gguf that embeds its vocabulary, but the existing benches stop at the 200-string proxy. Add two benches that frame a synthetic vocab through the same ggufValueTypeArray / ValueTypeString path the production reader walks, plus extend the bench harness's writeTestGGUFForBench to encode ggufArraySpec entries. Baseline on Apple M3 Ultra (-benchtime=2s): TokeniserVocab_10k 443µs/op 437kB/op 19915 allocs/op TokeniserVocab_200k 8.72ms/op 8.25MB/op 399915 allocs/op Per-entry cost = 2 allocs (one for the string box into []any, one for the string payload itself). The forward note flagged the boxed-element cost as the specialisation target. Co-Authored-By: Virgil <virgil@lethean.io>
Walk the name directly into the final "NN-body" buffer past the prefix bytes
instead of building a separate body slice and re-allocating for the final
concat. Reserve fallback capacity ("chapter-N" shape) up front so the empty-
name path stays single-alloc rather than growing twice.
BenchmarkSlug_Empty 83.4ns/103B/2 → 31.8ns/32B/1 allocs (-62% ns, -69% B, -50% allocs)
BenchmarkSlug_Clean 52.4ns/47B/1 → 37.4ns/32B/1 allocs (-29% ns, -32% B)
BenchmarkSlug_MixedCase 135.0ns/127B/2 → 121.6ns/127B/2 (slight)
BenchmarkBundleURI 75.2ns/111B/2 → 60.5ns/95B/2 (-20% ns, -14% B)
Co-Authored-By: Virgil <virgil@lethean.io>
W10-T forward-flagged: tokenizer.ggml.tokens with 200k+ vocab
entries dominates ReadInfo cost on tokeniser-embedded gguf models.
The generic readGGUFValue path was boxing each string into a []any
slot — one extra alloc and one 16-B interface header per token.
readGGUFValue now detects elementType == ValueTypeString inside the
ggufValueTypeArray branch and returns []string directly, calling
readStringIntoArena / readGGUFString through a tight switch-free
loop that never widens through interface{}.
metadataArrayLen already handled both []any and []string, so the
only production-code consumer (vocab-size inference) was free.
The in-package roundtrip test pattern-matched []any directly —
updated to a type switch that accepts either shape (non-string
element types still land as []any).
Apple M3 Ultra, benchtime=2s:
TokeniserVocab_10k 443µs → 343µs (-22.6% time)
437kB → 277kB (-36.6% bytes)
19915 → 9915 allocs (-50.2%, exactly -1
alloc per vocab entry)
TokeniserVocab_200k 8.72ms → 5.70ms (-34.7% time)
8.25MB → 5.05MB (-38.8% bytes)
399915 → 199915 allocs (-50.0%, exactly
-1 alloc per vocab entry)
Co-Authored-By: Virgil <virgil@lethean.io>
Factor the slug body writer into appendSlugBody so bundleURI can append the canonical "NN-body" fragment directly into a buffer already carrying the URI prefix, then append the "/bundle" suffix. Drops the intermediate slug string + outer concat to a single allocation. slug() retains its public test/bench shape via the same helper. BenchmarkBundleURI 60.5ns/95B/2 → 43.6ns/64B/1 alloc (-28% ns, -33% B, -50% allocs) Slug benchmarks unchanged (within bench noise, slight function-call overhead). Co-Authored-By: Virgil <virgil@lethean.io>
… TokeniserVocab_200k 8.72→5.70 ms -34.7%, -200k allocs per model load)
Hoist nine constant-message error constructors out of validateStoreKind, Run, runChapter, and resultError into package-scope sentinels so repeated validation/short-circuit paths do not allocate a fresh *Err on every call. Adds chapterFault as the sentinel-friendly sibling of chapterError; messages remain stable so JSON Error fields read identically. BenchmarkValidateStoreKind_Bad 0 B/0 allocs (was alloc-per-call) BenchmarkRun_Bad_MissingGenerate 80 B/1 alloc (remaining alloc is cfg clone) Co-Authored-By: Virgil <virgil@lethean.io>
Pass len(loadedBundle.Blocks) as an upper bound for unique chunk reads when
runChapter wires the counting store under runner.Generate. Adds
newCountingStoreHint so callers with a known cap skip the Generate-time
map-grow rehashes; newCountingStore stays as the zero-hint default.
BenchmarkCountingStore_Hinted_FillsExpected 678ns/2344B/3 allocs
BenchmarkCountingStore_Unhinted_FillsExpected 2456ns/4456B/9 allocs
-72% ns, -47% B, -67% allocs at N=64
Co-Authored-By: Virgil <virgil@lethean.io>
…prefix escapes into 1 writeBlockCacheHashString allocated a stack [4]byte length prefix per call and leaked it into hash.Hash.Write through the interface boundary. Four calls per blockRefs / blockCacheID meant four heap allocations for the length prefixes alone. writeBlockCacheHeader composes all four length-prefixed identity strings into a single buffer and writes them once. Bench delta (per call, M3 Ultra, benchtime=200ms): BlockCacheID_512TokenPrefix: 1258 → 1171 ns (-7%), 8 → 5 allocs BlockCacheID_2048TokenPrefix: 4272 → 4274 ns, 8 → 5 allocs WarmCache_Miss_512Tokens: 3589 → 3371 ns (-6%), 16 → 13 allocs WarmCache_Miss_2048Tokens: 8741 → 8621 ns, 28 → 25 allocs Three fewer allocations per WarmCache_Miss and per blockCacheID — load- bearing on the prompt-cache warm path because every block insertion fires the hash header setup. Co-Authored-By: Virgil <virgil@lethean.io>
Add a focused benchmark for the already-flat [1,vocab] lastTokenLogits shape and record the rejected handle-clone probe in GOAL.md. The runtime remains on the existing Reshape2 path because the matched retained request-context row regressed. Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Use DefaultProductionLane for the plain driver-profile generation and reporting defaults so the fast lane no longer falls back to the older one-run 32-token smoke shape. Co-Authored-By: Virgil <virgil@lethean.io>
Add a driver-profile fast-lane setup benchmark and pre-size the restore function list from the accepted Gemma 4 gate count. This removes one setup allocation and 96 bytes per plain and hyper-long default application. Co-Authored-By: Virgil <virgil@lethean.io>
Measure driverProfileRuntimeGates with the accepted fast-lane gates active so report-path allocations are visible before the next optimisation pass. Co-Authored-By: Virgil <virgil@lethean.io>
Allocate the runtime-gate report map lazily and size it for the accepted Gemma 4 gate set. BenchmarkDriverProfileRuntimeGates_DefaultFastLane drops from 952 B / 5 allocs to 664 B / 4 allocs per call. Co-Authored-By: Virgil <virgil@lethean.io>
Write retained reference-turn material directly into the chat-template prompt builder instead of building an intermediate wrapped string first. BenchmarkStateRampProfileTurnPrompt_Gemma4WholeTurn drops from about 541 ns / 4128 B / 4 allocs to about 180 ns / 1312 B / 2 allocs. Co-Authored-By: Virgil <virgil@lethean.io>
Strip Gemma 4 thought/control markers with a single pass instead of repeated replace/split calls. BenchmarkStateRampProfileVisibleOutput_Gemma4LongThoughtBlock drops from about 172 ns / 224 B / 4 allocs to about 110 ns / 128 B / 1 alloc. Co-Authored-By: Virgil <virgil@lethean.io>
Pre-size token phase trace storage for traced fresh and retained generation using MaxTokens. BenchmarkTokenPhaseTraceAppend_Preallocated1024 records about 34 us / 200 KB / 1 alloc versus about 89 us / 695 KB / 12 allocs for the nil-slice append path. Co-Authored-By: Virgil <virgil@lethean.io>
Keep hidden-output driver profiles timing-only by omitting decoded sampled text unless full output is requested. Add an explicit token phase text generation option for diagnostics that need decoded text. Co-Authored-By: Virgil <virgil@lethean.io>
Use the production long-form token budget for retained state-ramp turns instead of the old 1024-token smoke default. Keep the JSON contract proving the default is 8192 tokens. Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Co-Authored-By: Virgil <virgil@lethean.io>
Use a caller-owned stack slice when evaluating prefill cache state on the production cache-only and detach paths, while keeping the allocating compatibility helper for existing callers and tests. Benchmarks show the 26-cache Gemma 4 collector path moving from 153.6 ns/op, 416 B/op, 1 alloc/op to 109.1 ns/op, 0 B/op, 0 allocs/op. Co-Authored-By: Virgil <virgil@lethean.io>
Clear MLX allocator cache pressure between heavy paged-cache benchmark iterations via a raw internal helper, keeping runtime ClearCache behaviour unchanged while preventing broad benchmark sweeps from accumulating excessive active/cache memory. Verified with focused paged-cache benches, focused ClearCache/cache tests, go test ./go/internal/metal -count=1, and go test ./go/... -count=1. Co-Authored-By: Virgil <virgil@lethean.io>
Use stack-backed start/end slices for Gemma 4 gate/up split helpers so the MoE decode split path no longer allocates fresh slice metadata every call. Adds BenchmarkExpertIDSplitLastDimArray_Gemma4Decode; focused bench drops from 3 allocs/op to 2 allocs/op. Verified focused fused-gate tests, go test ./go/internal/metal -count=1, go test ./go/... -count=1, rebuilt lthn-mlx, and driver-profile smoke at 118.2 tok/s. Co-Authored-By: Virgil <virgil@lethean.io>
|




@coderabbitai summary
Summary by CodeRabbit
New Features
Improvements
Documentation