Releases · Anbeeld/beellama.cpp

22 May 17:15

Anbeeld

v0.2.0

285933e

v0.2.0 Latest

Latest

Added compatibility with upstream DFlash PR drafter GGUFs that use general.architecture = dflash. Bee now keeps this separate from the older dflash-draft schema, understands upstream metadata keys such as dflash.block_size and dflash.target_layer_ids, reads upstream tensor names, and keeps existing Bee/buun draft GGUF naming intact.
Tightened DFlash draft model discovery and converter behavior. Bee now prefers exact sibling DFlash draft directories, supports nested dflash_config metadata, scopes Gemma4 tokenizer handling correctly, and logs clearer DFlash metadata warnings and summaries during conversion.
Hardened recurrent memory, prompt-cache restore, and unified-KV scheduling. Recurrent resize now repairs its metadata after shrink/expand, the server shrinks recurrent state before prompt-cache save/load when it is safe, backup-sequence cleanup is tracked correctly, and non-parent tasks defer unified-KV admission so large pending prompts do not over-commit shared cells.
Added richer DFlash diagnostics, profiling, and validation. GGML_DFLASH_PROFILE now exposes categorized summary/replay/copy/prefill/verify/trace logging, routine decode timing is hidden behind debug logging instead of always printing, the profit controller now logs when it disables speculative depth, drafter/target contract and input validation are stricter, and Bee also exposes targeted debug envs such as GGML_DFLASH_DEBUG, GGML_DFLASH_INPUT_DEBUG, GGML_DFLASH_CUDA_DEBUG, GGML_DFLASH_FORCE_CPU_CROSS, GGML_DFLASH_VERBOSE_CONTRACT, and GGML_DFLASH_CRASH_TRACE.
Improved DFlash CUDA ordering and split-buffer correctness. Hidden capture, recurrent replay, backup copies, K/V projection-cache updates, and DFlash stream waits now use explicit ordering helpers and safer backend ownership checks instead of broader synchronization or wrong-buffer access.
Added DFlash drafter K/V projection caching for the cross-attention window. Bee now keeps ring-backed drafter K/V state for recent target hidden-state windows, supports chronological D2D append/interleave on CUDA, excludes the unsafe parts from graph capture when needed, and falls back more safely on placements that cannot use the fast GPU path.
Reworked DFlash prefill capture and flush handling. Prefill capture now uses per-slot and per-view plans, GPU staging buffers, source-aware CPU/GPU ring validity, suffix-span tracking across internal ubatches, graph-reuse keys for source/destination offsets, callback suppression for irrelevant ubatches, and fail-closed behavior for partial or mismatched captures.
Hardened target hidden-state capture across Qwen3.5, Qwen3.5-MoE, Gemma4-ISWA, hidden-only contexts, GPU tape, and multi-slot GPU cross data. Capture layer assignment, token-count derivation, callback routing, and GPU multi-slot cross collection now have explicit correctness checks.
Reduced greedy DFlash verification overhead and made verifier control stricter. Eligible verify batches can use reduced top-k logits without raw-logit readback, Bee keeps seed-row alignment correct, the flat verify horizon is capped, server-side depth control is authoritative, and the reduced path falls back when grammar, sampler, or reasoning state requires full logits.
Hardened DFlash reasoning, draft, and suffix handling. Reasoning-end forcing now goes through the normal full-logits path when needed, invalid reduced-logits drafts are rejected instead of crashing or looping, empty drafts fall back safely, accepted-prefix full-KV commits respect the drafter window, explicit --spec-draft-ctx-size overrides are tracked correctly, Bee keeps the DFlash auto--cd 256 default path when no draft ctx is passed, and the drafter stays aligned with the live accepted suffix.
Improved Gemma 4 support substantially. Bee added Gemma4-ISWA DFlash target plumbing and profiling callbacks, ported the cleaner upstream Gemma4 graph and loader path back onto Bee hooks, restored Bee precision behavior where needed, synced SWA max-position authority and 512-dim FlashAttention selection with upstream, and fixed Gemma multimodal image decode and dynamic resize bounds.
Extended CUDA kernel coverage and backend hardening. Bee now keeps 512-wide quantized FlashAttention instances for standard and TurboQuant/TCQ KV combinations, syncs upstream Hadamard rotation plumbing, propagates CUDA driver links correctly, and hardens op-table / Gated DeltaNet integration alongside long-context GPU ring stability fixes.
Reduced peak memory in the perplexity tool and fixed streaming perplexity / KLD cache handling. Streaming perplexity now writes bounded chunks, checks stream errors, avoids retaining unbounded logits for long-context KL runs, and keeps the logits-cache format versioning compatible with the legacy magic.
Completed the malformed tool-call guard path for non-stream responses. Final OpenAI-compatible responses now quarantine malformed raw tool-looking text the same way streamed tool-parsing responses already did.

Windows:

Assets 6

13 May 16:54

Anbeeld

v0.1.2

633cd34

v0.1.2

Fixed the adaptive profit controller's no-spec baseline path. Profit mode now seeds baseline samples before positive-depth warmup, can shut DFlash fully off when the measured baseline wins, and no longer makes speculative decisions from draft-only telemetry.
Fixed profit-controller reset handling across context-bucket and configuration changes so cleared baseline telemetry cannot leave the controller in a stale active or off state.
Added low-frequency profit-controller baseline reprobes with --spec-dm-profit-baseline-interval / LLAMA_ARG_SPEC_DM_PROFIT_BASELINE_INTERVAL so runs can refresh target-only timing as context grows. The default interval is 1024 active speculative cycles; reprobes resume the previous active draft depth and avoid off-probe counter starvation.
Hardened active-reasoning EOS handling. When an end-of-generation token appears while reasoning output is still active, the sampler now forces the reasoning-end sequence through the normal full-logits path; reduced DFlash verification rejects that case instead of accepting an unsafe reduced candidate set.
Hardened DFlash on split CUDA / multi-GPU placement. GPU cross-ring setup, hidden capture, CUDA graph capture, K/V projection cache updates, recurrent replay, conv replay, and async tensor get/set paths now check buffer/backend ownership and fall back to safer CPU or owning-buffer paths instead of reading or writing recurrent state through the wrong CUDA backend.
Added clearer diagnostics and regression coverage for multi-GPU DFlash fallback decisions, CUDA graph buffer visibility, wrong-device async tensor access, active-reasoning reduced-sampling rejection, adaptive DM defaults, and profit-controller baseline behavior.
Fixed ROCm 7 build: added cudaPointerAttributes / cudaMemoryType shim aliases to hip.h, extended CUDART_VERSION >= 10000 guards with || defined(GGML_USE_HIP) so the .type field path is taken on HIP, and removed the WIN32 guard around TurboQuant flash-attention instance compilation so Linux ROCm builds include the turbo KV-cache kernels (acerspyro#11).
Known limitation: the current multi-GPU DFlash path is a correctness fallback, not a performant split-GPU implementation. On split target placement it can be slower than non-speculative decoding because recurrent replay and hidden capture avoid unsafe single-backend GPU fast paths. A performant implementation still needs per-device replay graphs or a scheduler that follows ggml's split-buffer ownership model.

Windows:

Assets 6

11 May 02:50

Anbeeld

v0.1.1

547676e

v0.1.1

Improved agentic tool-call reliability with lazy grammars. DFlash now remains enabled before a lazy grammar trigger, but stops speculating once grammar-constrained output or reasoning-budget forcing requires normal token-by-token sampling.
Fixed DFlash accept bookkeeping at grammar and tool-call boundaries. The server now distinguishes accepted draft tokens from bonus-token-shaped results, updates DFlash hidden-state rows with the root plus accepted draft tokens, and uses the same keep count for rollback.
Added a DFlash suppression guard for raw tool-call markers. When a tool marker appears while lazy grammar is enabled, the server suppresses DFlash for the rest of that response without steering sampler state; fenced code and embedded marker-like strings are excluded from the guard.
Made partial OpenAI-compatible tool-call streaming safer. The server can stream a stable tool name/id early so clients can show a pending tool call, while withholding partial arguments until the parser sees a complete call.
Quarantined malformed raw tool-call text in tool-parsing streams. Unfinished or malformed tool-looking text no longer leaks into visible assistant content or hidden reasoning deltas before the parser can classify it.
Accepted direct tag-style function starts for Qwen-style tool calls. Lazy grammar triggers now include structural function markers such as <function=, and the tag parser can parse valid direct function calls without the outer <tool_call> wrapper.
Added regression coverage for Kimi and Qwen tool-call streaming, malformed raw marker quarantine, fenced-code false positives, direct Qwen function calls, lazy grammar triggers, and DFlash speculative boundary plumbing.
Fixed small build issues found after 0.1.0: the DFlash callback setup now uses an explicit callback type for GCC 15, and tests/server code include the required standard headers for INT_MAX and FLT_MAX.

Windows:

Assets 6

09 May 16:31

Anbeeld

v0.1.0

a1e392e

v0.1.0

DFlash speculative decoding: --spec-type dflash drives a DFlash draft GGUF alongside the target model. The target captures hidden states into a per-layer 4096-slot ring buffer, the drafter cross-attends to the most recent --spec-dflash-cross-ctx hidden-state tokens and proposes drafts for target verification.
TurboQuant / TCQ KV-cache compression: Five cache types (turbo2, turbo3, turbo4, turbo2_tcq, turbo3_tcq) spanning from 4x to 7.5x compression, with higher-bit options being practically lossless in many cases. Set independently with --cache-type-k and --cache-type-v.
Adaptive draft-max control: The server adjusts the active draft horizon at runtime instead of using a fixed --spec-draft-n-max. The default profit controller compares speculative throughput against a no-spec baseline; the fringe alternative maps acceptance-rate bands to draft depth. Use --no-spec-dm-adaptive for a static horizon.
Full multimodal support: When --mmproj is active, the server keeps flat DFlash available for text generation. The model can be fully offloaded to CPU with no problems to reduce VRAM pressure.
Reasoning-loop protection: The server detects repeated hidden reasoning output and intervenes. Default mode is force-close with --reasoning-loop-window and --reasoning-loop-max-period tuning available.
Sampled DFlash verification: --spec-draft-temp enables rejection-sampling drafter behavior. Activates when both draft and target temperature exceed zero. Draft log probabilities must be available for rejection sampling to produce correct output.
DDTree branch verification: optional --spec-branch-budget adds branch nodes beyond the main draft path with GPU parent_ids, tree masks, and recurrent tree kernels. Disabled automatically when the target model spans more than one GPU. This one is very much work in progress!
Request-level speculative overrides: Draft-max and branch budget can be overridden per-request through JSON fields without restarting the server.
CopySpec model-free speculation: --spec-type copyspec provides rolling-hash suffix matching over previous tokens without a draft model. Results must be benchmarked per workload.

Windows:

Assets 6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Uh oh!

Uh oh!

Releases: Anbeeld/beellama.cpp

v0.2.0

Uh oh!

v0.1.2

Uh oh!

v0.1.1

Uh oh!

v0.1.0

Uh oh!