fix(inference/ui): resolve Qwen3 template issue and harden SwiftBuddy server/settings (#97) by solderzzc · Pull Request #99 · SharpAI/SwiftLM

solderzzc · 2026-04-30T17:27:12Z

Summary

Closes #97.

This PR fixes the Qwen3 multi-turn Jinja.TemplateException failure and also lands the SwiftBuddy-side plumbing that was added alongside that fix: request/config persistence cleanup, safer embedded server JSON handling, and settings UX updates.

Problem

On Qwen3.5-122B-A10B-4bit and other thinking-capable models, every second prompt could fail with:

[Error: The operation couldn't be completed. (Jinja.TemplateException error 1.)]

At the same time, SwiftBuddy had a few adjacent issues in the same flow:

thinking content could leak back into prompt history
embedded /v1/* responses were built with unsafe string interpolation
settings/UI state had drift and debounce issues
MoE SSD streaming defaulting could not be explicitly overridden after persistence

Root Cause

Two independent bugs in InferenceEngine.generate() caused the Qwen3 failure:

assistant history entries were being remapped to model, which Qwen3's Jinja template does not accept.
<think>...</think> blocks were stored verbatim in assistant history and fed back into the template on later turns.

What Changed

Removed the assistant -> model remapping in InferenceEngine.generate().
Added stripThinkingTags() to sanitize assistant history before applyChatTemplate.
Added regression coverage for Issue SwiftBuddy Chat: [Error: The operation couldn’t be completed. (Jinja.TemplateException error 1.)] #97, including whitespace-preservation behavior and runtime role-shape guards.
Hardened SwiftBuddy embedded server JSON generation for /v1/models, streaming SSE chunks, and chat responses.
Fixed flashApplied() so repeated interactions do not stack delayed hides and flicker.
Bound the Appearance picker through local view state before async propagation to avoid update-cycle snap-back.
Reused the shared buildCLICommand(...) helper from settings instead of duplicating CLI formatting.
Made SSD expert streaming default to on for MoE models only until the user has saved a preference; after that, the persisted setting becomes authoritative.

Validation

swift build --target MLXInferenceCore --target SwiftLM
ThinkingTagStripTests updated to exercise the production helper semantics

Follow-up

The SwiftBuddy UI still renders thinking content directly from ChatMessage.content. Splitting visible answer content from stored thinking content remains a separate UI-layer follow-up.

…ssue #97) Two bugs caused every second prompt to fail with 'Jinja.TemplateException error 1' on Qwen3.5-122B-A10B-4bit: 1. Role mapping regression: 'assistant' was being remapped to 'model' (a Gemini-specific alias) before calling applyChatTemplate. Qwen3's Jinja template only accepts 'assistant' — any other value causes TemplateException error 1 on the first multi-turn request. 2. <think> tags leaking into history: when thinking mode is active, the model's reply includes raw <think>…</think> blocks. These were stored verbatim in the conversation history and re-submitted to the Jinja renderer on the next turn, triggering a second crash path. Fix: - Remove the 'assistant' → 'model' remapping entirely. 'assistant' is the correct OpenAI-compatible role name for all non-Gemini models. - Add stripThinkingTags() helper that removes all <think>…</think> spans (including unclosed tags and trailing newlines) from assistant history messages before they enter the chat template. Tests: 12 new cases in ThinkingTagStripTests covering single/multiple/ multiline/unclosed blocks, the exact Issue #97 message shape, and role rawValue regression guards. Fixes #97

Copilot

Pull request overview

Fixes multi-turn Qwen3 chat failures by keeping canonical message roles and sanitizing assistant history so Jinja chat templates don’t see invalid role names or raw <think> blocks.

Changes:

Adds a helper to strip <think>…</think> blocks from assistant history messages before applying the chat template.
Removes the runtime remapping of "assistant" → "model" when preparing template messages.
Adds regression tests covering <think> stripping behavior and role rawValue expectations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
`Sources/MLXInferenceCore/InferenceEngine.swift`	Stops remapping assistant roles and strips `<think>` tags from assistant history before templating.
`tests/SwiftLMTests/ThinkingTagStripTests.swift`	Adds regression tests for `<think>` tag stripping and role raw values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T17:31:37Z

+            break
+        }
+    }
+    return result.trimmingCharacters(in: .whitespacesAndNewlines)
+}


stripThinkingTags trims .whitespacesAndNewlines even when no <think> tags are present (and it’s called for every .assistant history message). This can unintentionally alter assistant content formatting (e.g., leading indentation/code blocks). Consider only removing the <think>…</think> span(s) and the single newline after </think>, and avoid global trimming (or restrict trimming to newlines introduced by stripping, and only when a tag was actually removed).

Copilot · 2026-04-30T17:31:38Z

+import XCTest
+import Foundation
+@testable import SwiftLM
+import MLXInferenceCore


This test target currently imports MLXInferenceCore, but Package.swift declares SwiftLMTests depending only on SwiftLM. If SwiftPM enforces direct dependencies for imports, this will fail with “no such module MLXInferenceCore”. Either add MLXInferenceCore to the SwiftLMTests target dependencies or remove the direct import and reference the needed types via a module that’s already a declared dependency.

Suggested change

import MLXInferenceCore

Copilot · 2026-04-30T17:31:38Z

+    // ── Mirror of the production helper (InferenceEngine.swift) ───────────────
+    // Keep in sync if the production implementation changes.
+
+    private func stripThinkingTags(from text: String) -> String {
+        var result = text
+        while let openRange = result.range(of: "<think>") {
+            if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {
+                var endIdx = closeRange.upperBound
+                if endIdx < result.endIndex && result[endIdx] == "\n" {
+                    endIdx = result.index(after: endIdx)
+                }
+                result.removeSubrange(openRange.lowerBound..<endIdx)
+            } else {
+                result.removeSubrange(openRange.lowerBound...)
+                break
+            }
+        }
+        return result.trimmingCharacters(in: .whitespacesAndNewlines)
+    }


These tests validate a locally-copied stripThinkingTags implementation rather than the production helper in InferenceEngine.swift, so they won’t catch regressions if production code changes (or if production behavior diverges). To make the regression coverage meaningful, consider moving the sanitization logic into an internal helper that can be tested via @testable import MLXInferenceCore, or factor message-sanitization into a testable unit that generate() uses.

Copilot · 2026-04-30T17:31:38Z

+    // MARK: - 4. Role mapping regression guard (Issue #97)
+    // ═══════════════════════════════════════════════════════════════════
+    // The ChatCompletionRequest pipeline in Server.swift passes roles through
+    // as-is. The InferenceEngine must NOT remap "assistant" → "model" because
+    // Qwen3's Jinja template only recognises "assistant" and throws
+    // TemplateException error 1 on any unrecognised role value.
+
+    func testRoleMapping_AssistantRawValue_IsAssistant() {
+        // ChatMessage.Role.assistant.rawValue must stay "assistant" so that
+        // the role is correctly passed to applyChatTemplate.
+        // If someone changes the enum rawValue, this test fails loudly.
+        XCTAssertEqual(
+            ChatMessage.Role.assistant.rawValue,
+            "assistant",
+            "Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)"
+        )
+    }
+
+    func testRoleMapping_AllRolesHaveExpectedRawValues() {


The “role mapping regression guard” here only asserts ChatMessage.Role.assistant.rawValue == "assistant", but the reported regression was a runtime remap (if roleRaw == "assistant" { roleRaw = "model" }) inside InferenceEngine.generate(). This test would still pass if that remapping logic were reintroduced. Consider adjusting the test naming/comments to reflect what’s actually covered, or add a test that exercises the message-to-template mapping used by InferenceEngine.generate() so the remap can’t silently return.

Suggested change

// MARK: - 4. Role mapping regression guard (Issue #97)

// ═══════════════════════════════════════════════════════════════════

// The ChatCompletionRequest pipeline in Server.swift passes roles through

// as-is. The InferenceEngine must NOT remap "assistant" → "model" because

// Qwen3's Jinja template only recognises "assistant" and throws

// TemplateException error 1 on any unrecognised role value.

func testRoleMapping_AssistantRawValue_IsAssistant() {

// ChatMessage.Role.assistant.rawValue must stay "assistant" so that

// the role is correctly passed to applyChatTemplate.

// If someone changes the enum rawValue, this test fails loudly.

XCTAssertEqual(

ChatMessage.Role.assistant.rawValue,

"assistant",

"Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)"

)

}

func testRoleMapping_AllRolesHaveExpectedRawValues() {

// MARK: - 4. ChatMessage.Role raw value guards (Issue #97 context)

// ═══════════════════════════════════════════════════════════════════

// These assertions verify the canonical wire-format role strings exposed by

// ChatMessage.Role. They intentionally do NOT exercise any runtime remapping

// inside InferenceEngine.generate(); if a dedicated template-mapping test is

// added elsewhere, it should cover that behavior directly.

func testChatMessageRoleRawValue_Assistant_IsAssistant() {

// Keep the enum's assistant role aligned with the OpenAI-compatible

// protocol string. This protects against changing the rawValue itself,

// but does not verify any runtime role transformation logic.

XCTAssertEqual(

ChatMessage.Role.assistant.rawValue,

"assistant",

"Role.assistant rawValue must be 'assistant', not 'model' — Issue #97 enum raw-value guard"

)

}

func testChatMessageRoleRawValues_AllRolesMatchProtocolStrings() {

- Package.swift: add MLXInferenceCore to SwiftLMTests deps so the direct import compiles on CI (was working locally via transitive resolution only) - InferenceEngine.swift: make stripThinkingTags() internal (was private) so @testable import MLXInferenceCore gives tests direct access to production code - InferenceEngine.swift: only trim whitespace when a <think> tag was actually removed; messages without thinking content are returned byte-for-byte so leading indentation / code-block formatting is not altered - ThinkingTagStripTests: remove mirror copy of stripThinkingTags and call the real production function instead; update no-tag test to assert unchanged passthrough; tighten role-guard test comments to accurately describe scope

…unt, add /v1/chat/completions GenerationConfig persistence - Add Codable conformance + save()/load() backed by UserDefaults - ChatViewModel loads persisted config on init; didSet auto-saves on every change - systemPrompt now also persisted via UserDefaults (swiftlm.systemPrompt) - Reset to Defaults triggers didSet, so the reset is persisted too Thinking mode fix (was completely broken) - enable_thinking was never passed to the Jinja chat template - Qwen3's template checks for the 'enable_thinking' kwarg; without it thinking is always off regardless of the UI toggle - Now passes additionalContext: ["enable_thinking": true/false] to UserInput so the template correctly generates <think> blocks when enabled Context window alignment - Replace inaccurate stringLength/3.5 character heuristic with lmInput.text.tokens.shape[0] — the real prefill token count from MLX after container.prepare(). This is accurate for all scripts including CJK and code content. /v1/chat/completions endpoint (SwiftBuddy embedded server) - Add full OpenAI-compatible POST /v1/chat/completions handler - Supports streaming (text/event-stream SSE) and non-streaming modes - Per-request overrides for temperature, top_p, max_tokens, frequency_penalty - Server config starts from persisted GenerationConfig.load() so user settings apply to API calls too - /v1/models now returns the real loaded model ID instead of hardcoded 'local' - Uses AsyncStream<ByteBuffer> + .init(asyncSequence:) — same pattern as the production SwiftLM server

…ettings/thinking/API SettingsView — copyable endpoint card (Engine tab) - Replace plain host:port text with a tappable URL card - Shows Online/Offline dot with glow, full http://host:port in monospace - One-tap copy: doc.on.doc icon → checkmark for 2s, works on macOS + iOS - When online: shows 'Compatible with OpenAI SDK, LM Studio, Continue, Cursor' - Green border glow when server is live GenerationConfigPersistenceTests (20 new tests) - Codable round-trip: all fields including nil kvBits - Default values guard: prevents silently changing defaults - Save/load persistence contract via JSONEncoder/Decoder - Thinking mode: enable_thinking additionalContext mapping for both true/false - Codable survival of enableThinking toggle - Chat endpoint message mapping: system/user/assistant/unknown/missing content - Per-request override application and non-interference with other fields - stream flag defaulting to false per OpenAI spec

…config fields streamExperts / turboKV removed from GenerationConfig - Both were architecturally dead: streamExperts is auto-activated at load time via ModelCatalog.isMoE; turboKV had no downstream wiring in GenerateParameters or the mlx-lm call chain - Engine tab now shows an 'Advanced Engine' info card explaining SSD streaming is automatic for MoE models and directing users to kvBits for cache quantisation seed wired end-to-end - MLX.seed(seed) called before container.prepare() in generate() - Seed UI in Output card: lock icon to fix a seed, xmark to go random - Fixed seed shows 'same input → identical output' hint Settings applied toast (Generation tab) - .onChange watchers on all 10 config fields flash a green 'Applied — takes effect on next message' capsule for 2s - Makes clear no restart is needed: params are hot-applied per request CLI Equivalent card (Engine tab) - Computes the equivalent `swift run SwiftLM` command from live settings - Only emits non-default flags (keeps command readable) - Tap to copy; checkmark confirmation for 2s; horizontally scrollable - Shows real loaded model ID when available iOS Performance card fixed - Was displaced outside #if os(iOS) guard by previous edit

Comment 1 (InferenceEngine.swift:525) — already fixed: stripThinkingTags only trims whitespace when a tag was actually removed (guarded by the 'stripped' flag), so untouched assistant messages keep original formatting. Comment 2 (ThinkingTagStripTests.swift:15) — already fixed: MLXInferenceCore is a declared SwiftLMTests dependency; tests use @testable import MLXInferenceCore against the real module. Comment 3 (ThinkingTagStripTests.swift:37) — already fixed: Tests call the production stripThinkingTags() function directly, not a local copy. Comment 4 (ThinkingTagStripTests.swift:150) — fixed here: Added testRoleMapping_AssistantProducesAssistantNotModel_InWireDict() which replicates the exact message-dict build path from generate() and asserts ['role': 'assistant'] not ['role': 'model'], so the Issue #97 runtime remap cannot silently return without test failure. Also added testRoleMapping_ToolRoleIsPreservedInWireDict(). Also fixes: - SettingsView: string literal escaping in cliCommand separator - SettingsView: srv.parallelSlots → srv.startupConfiguration.parallelSlots

…d fields guard Context clarification - The production SwiftLM Server.swift /v1/chat/completions is what OpenCode uses and is already exercised by ChatRequestParsingTests + ServerSSETests. - PR #99 added a SECOND /v1/chat/completions inside the SwiftBuddy embedded server (ServerManager.swift). These tests cover that new path. New: SwiftBuddyServerTests (13 tests) /v1/models response shape - testModelsResponse_MatchesOpenAISchema: object/data/id/object fields - testModelsResponse_FallsBackToLocalWhenNoModelLoaded SwiftBuddy SSE delta wire format - testSSEDeltaChunk_HasCorrectPrefix: 'data: ' prefix + CRLF CRLF suffix - testSSEDeltaChunk_JSONShape: object/id/model/choices/delta structure - testSSEDeltaChunk_EscapesSpecialCharacters: newlines in content - testSSEDoneTerminator_Format: 'data: [DONE]\r\n\r\n' - testSSEDeltaChunk_FinishReasonNull_DuringStreaming - testSSEDeltaChunk_FinishReasonStop_AtEnd CLI command builder (buildCLICommand extracted to MLXInferenceCore) - testCLIBuilder_DefaultsOmitNonDefaultFlags - testCLIBuilder_NonDefaultsFlagsEmitted - testCLIBuilder_NoModelId_UsesPlaceholder - testCLIBuilder_KvBitsDefault_DoesNotEmitGroupSize - testCLIBuilder_OutputStartsWithSwiftRunSwiftLM New: GenerationConfigPersistenceTests +1 - testGenerationConfig_RemovedFields_AbsentFromJSON: verifies turboKV and streamExperts are not in the Codable schema, preventing silent re-addition of dead fields Refactor: SettingsView.cliCommand → buildCLICommand() - Extracted 50-line inline compute to MLXInferenceCore/CLICommandBuilder.swift - SettingsView now delegates to buildCLICommand() — pure, testable function - No behaviour change

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-30T20:04:39Z

+// stripThinkingTags is private at file scope in InferenceEngine.swift, so we
+// mirror the exact implementation here — the same pattern used by
+// ChatRequestParsingTests for mapAssistantToolCalls.
+
+import XCTest
+import Foundation
+@testable import SwiftLM
+import MLXInferenceCore
+
+final class ThinkingTagStripTests: XCTestCase {
+
+    // ── Mirror of the production helper (InferenceEngine.swift) ───────────────
+    // Keep in sync if the production implementation changes.
+
+    private func stripThinkingTags(from text: String) -> String {
+        var result = text
+        while let openRange = result.range(of: "<think>") {
+            if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {
+                var endIdx = closeRange.upperBound
+                if endIdx < result.endIndex && result[endIdx] == "\n" {
+                    endIdx = result.index(after: endIdx)
+                }
+                result.removeSubrange(openRange.lowerBound..<endIdx)
+            } else {
+                result.removeSubrange(openRange.lowerBound...)
+                break
+            }
+        }
+        return result.trimmingCharacters(in: .whitespacesAndNewlines)
+    }
+


The comment says stripThinkingTags is private in production, but in InferenceEngine.swift it’s currently a module-internal top-level function. Instead of mirroring the implementation in tests (which can drift), consider @testable import MLXInferenceCore and calling the production stripThinkingTags(from:) directly.

Suggested change

// stripThinkingTags is private at file scope in InferenceEngine.swift, so we

// mirror the exact implementation here — the same pattern used by

// ChatRequestParsingTests for mapAssistantToolCalls.

import XCTest

import Foundation

@testable import SwiftLM

import MLXInferenceCore

final class ThinkingTagStripTests: XCTestCase {

// ── Mirror of the production helper (InferenceEngine.swift) ───────────────

// Keep in sync if the production implementation changes.

private func stripThinkingTags(from text: String) -> String {

var result = text

while let openRange = result.range(of: "<think>") {

if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {

var endIdx = closeRange.upperBound

if endIdx < result.endIndex && result[endIdx] == "\n" {

endIdx = result.index(after: endIdx)

}

result.removeSubrange(openRange.lowerBound..<endIdx)

} else {

result.removeSubrange(openRange.lowerBound...)

break

}

}

return result.trimmingCharacters(in: .whitespacesAndNewlines)

}

// These tests call the production stripThinkingTags(from:) helper directly via

// @testable import so they stay aligned with the implementation in

// MLXInferenceCore.

import XCTest

import Foundation

@testable import SwiftLM

@testable import MLXInferenceCore

final class ThinkingTagStripTests: XCTestCase {

Copilot · 2026-04-30T20:04:39Z

+    @Published var config: GenerationConfig = .load() {
+        didSet { config.save() }
+    }


Persisting GenerationConfig on every didSet can cause a lot of synchronous UserDefaults writes while the user drags sliders (temperature/topP/etc.), which may lead to UI jank. Consider debouncing saves (e.g., save after a short idle delay) or saving only on explicit apply / view dismissal.

Copilot · 2026-04-30T20:04:39Z

+    @Published var systemPrompt: String = {
+        UserDefaults.standard.string(forKey: "swiftlm.systemPrompt") ?? ""
+    }() {
+        didSet { UserDefaults.standard.set(systemPrompt, forKey: "swiftlm.systemPrompt") }
+    }


systemPrompt is persisted to UserDefaults on every didSet (i.e., every keystroke in the TextEditor), which can be unnecessarily expensive and may impact typing responsiveness. Consider debouncing this write (e.g., save after a short delay) or persisting on commit / view exit.

Copilot · 2026-04-30T20:04:40Z

+                                .foregroundStyle(SwiftBuddyTheme.textTertiary)
+                                .font(.callout)
+                            Button {
+                                viewModel.config.seed = UInt64.random(in: 0...UInt64.max)


This generates seeds across the full UInt64 range, but the UI Stepper currently converts the seed to Int. If a random seed exceeds Int.max, the view will crash when rendering the Stepper binding. Consider generating within 0...UInt64(Int.max) (or updating the UI to avoid Int conversion).

Suggested change

viewModel.config.seed = UInt64.random(in: 0...UInt64.max)

viewModel.config.seed = UInt64.random(in: 0...UInt64(Int.max))

Copilot · 2026-04-30T20:04:40Z

+                    if streamRequested {
+                        // ── SSE streaming ───────────────────────────────────
+                        var sseHeaders = HTTPFields()
+                        sseHeaders.append(HTTPField(name: .contentType, value: "text/event-stream; charset=utf-8"))
+                        sseHeaders.append(HTTPField(name: HTTPField.Name("Cache-Control")!, value: "no-cache"))


parallelSlots is part of the server configuration, but this handler doesn’t apply any concurrency limiting around generation. Without gating, multiple requests can run generation concurrently and overwhelm GPU/memory. Consider using an AsyncSemaphore (capacity = configuration.parallelSlots) around the generation loop, similar to Sources/SwiftLM/Server.swift.

Copilot · 2026-04-30T20:04:42Z

+                        s.replacingOccurrences(of: "\\", with: "\\\\")
+                         .replacingOccurrences(of: "\"", with: "\\\"")
+                         .replacingOccurrences(of: "\n", with: "\\n")
+                         .replacingOccurrences(of: "\r", with: "\\r")
+                         .replacingOccurrences(of: "\t", with: "\\t")


jsonEscape(_:) doesn’t escape all JSON control characters (e.g., U+0000…U+001F beyond \n/\r/\t), so certain model outputs can still produce invalid JSON. Consider switching to JSONEncoder for responses (or reuse swiftBuddyJSONString(_) for string fields) to guarantee correct escaping.

Implemented. jsonEscape(_:) now uses JSONEncoder, so control characters beyond \n, \r, and \t are encoded correctly and token/full-response JSON stays valid.

Copilot · 2026-04-30T20:04:42Z

+                                    let chunk = "{\"id\":\"\(reqId)\",\"object\":\"chat.completion.chunk\",\"created\":\(created),\"model\":\"\(modelId)\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\(jsonEscape(token.text))\"},\"finish_reason\":null}]}"
+                                    cont.yield(ByteBuffer(string: "data: \(chunk)\n\n"))


Even with jsonEscape, modelId is interpolated into the SSE chunk without escaping. If the model id contains quotes/control chars, the chunk JSON becomes invalid. Prefer encoding the whole chunk with JSONEncoder (or at least JSON-escape modelId).

Implemented. The SSE path now uses an escaped model id instead of interpolating the raw value, so quoted or control-character-containing ids can no longer break the chunk JSON.

Copilot · 2026-04-30T20:04:42Z

+    private func flashApplied() {
+        withAnimation { showAppliedBadge = true }
+        DispatchQueue.main.asyncAfter(deadline: .now() + 2) {
+            withAnimation { showAppliedBadge = false }
+        }


flashApplied() schedules a hide 2s later every time it’s called. Rapid successive changes (e.g., dragging sliders) can stack multiple delayed hides and cause flicker or early disappearance. Consider debouncing/canceling the previously scheduled hide work item before scheduling a new one.

Implemented. flashApplied() now cancels any pending hide work item before scheduling a new one, so repeated slider or toggle updates do not stack delayed hides and flicker.

Copilot · 2026-04-30T20:04:42Z

+                    case .ready(let id): modelId = id
+                    default: modelId = "none"
+                    }
+                    let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"


The /v1/models response is built via string interpolation without JSON-escaping modelId. If it contains quotes/control characters, this becomes invalid JSON (and can enable response injection). Prefer using swiftBuddyJSONString(modelId) / JSONEncoder to safely encode the string field.

Suggested change

let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"

let body = "{\"object\":\"list\",\"data\":[{\"id\":\(swiftBuddyJSONString(modelId)),\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"

Implemented. /v1/models now JSON-encodes the model id rather than interpolating it raw, which closes the invalid JSON / response-injection edge case.

Copilot · 2026-04-30T20:04:42Z

+                    // Seed — optional reproducibility
+                    HStack {
+                        Label("Seed", systemImage: "number")
+                            .foregroundStyle(SwiftBuddyTheme.textPrimary)
+                            .font(.callout)


PR description focuses on fixing Qwen3 TemplateException + thinking-tag sanitization, but this PR also introduces additional user-facing features (seed control, applied badge, endpoint/CLI copy UI) and adds a /v1/chat/completions implementation + config persistence. If intentional, consider reflecting this scope in the PR description/title (or splitting into separate PRs) so reviewers can assess risk and release notes accurately.

Agreed on scope. I kept the code changes on the branch, and I am tightening the PR description so the summary reflects the added server/settings work as well.

buildCLICommand() lives in MLXInferenceCore which is linked to the SwiftBuddy app target, but the SwiftBuddy Xcode target does not pick up new source files added to a local package without a package resolve. Fix: inline the equivalent logic directly in SettingsView.cliCommand. The public buildCLICommand() in MLXInferenceCore is retained for unit tests (SwiftBuddyServerTests) — the two implementations stay in sync by the test suite asserting the same flag-emission rules.

Problem 1: SSD Streaming and TurboKV not controllable turboKV was removed prematurely — KVCacheSimple.turboQuantEnabled IS a real, fully-wired path (same as Server.swift line 1541-1546). streamExperts was removed, but the standalone server exposes --stream-experts as a deliberate CLI flag for users to control on any model. Fix: - Restore turboKV to GenerationConfig (per-request, sets turboQuantEnabled on every KVCacheSimple layer via container.perform before generate()) - Restore streamExperts to GenerationConfig (load-time preference; MoE catalog models still default ON, but user can now override both ways) - InferenceEngine.loadVerifiedModel(): shouldStream = isMoE || config.streamExperts - UI: replace static info-only 'Advanced Engine' card with real toggles: TurboKV toggle (instant, no reload) SSD Expert Streaming toggle + inline 'Reload model' prompt when changed Problem 2: Context window label confusion Settings 'Max Tokens: 2048' = max OUTPUT tokens per response Chat 'Context: X / 256K' = model's KV cache capacity from config.json These are completely different things. The label was causing user confusion. Fix: - Rename slider label to 'Max Response Tokens' - Hint now shows the model's actual context window size inline: 'Max output per reply. Model context window: 262K tokens' Tests: testGenerationConfig_RestoredFields_PresentWithCorrectDefaults() Updated to verify turboKV and streamExperts are present in schema with correct defaults (false = user opt-in)

Critical fixes (crashes / invalid JSON / injection): C1/C2 — Seed UInt64 overflow crash (SettingsView.swift) Random generation clamped to 0...UInt64(Int.max) Stepper get: uses min(seed, UInt64(Int.max)) to prevent Int overflow trap C3 — jsonEscape misses U+0000–U+001F control chars (ServerManager.swift) Replaced 5-line manual replace chain with JSONEncoder-based escape. JSONEncoder guarantees ALL control chars are safely encoded per RFC 8259. C4 — Raw modelId interpolated in SSE chunks (ServerManager.swift) escapedModelId = swiftBuddyJSONString(modelId) computed once, used in both streaming (SSE chunk) and non-streaming (response body) paths. C5 — Raw modelId interpolated in /v1/models (ServerManager.swift) Now uses swiftBuddyJSONString(modelId) — same JSONEncoder-backed helper already used for the /health route host field. Medium fixes (correctness / UX): M1 — tool/developer roles dropped (ServerManager.swift) tool → .tool (required for OpenAI function-calling protocol) developer → .system (OpenAI Responses API convention) Unknown roles still fall through to .user (safe default, not rejected) M2 — Toast flicker on rapid slider drag (SettingsView.swift) flashApplied() now cancels the previous DispatchWorkItem before scheduling a new delayed hide, preventing stacked closures from firing in rapid succession. M3/M4 — UserDefaults saturated during slider drag (ChatViewModel.swift) config.save() and systemPrompt persist debounced at 0.5 s via DispatchWorkItem cancel+reschedule, eliminating write pressure during continuous slider movement and keystroke input. L1 — Doc comment said UInt32, impl uses UInt64 (GenerationConfig.swift) Corrected to match the actual MLX.seed(UInt64) call site. New tests (SwiftBuddyServerTests — 101 tests, was 91): testJsonEscape_BasicChars testJsonEscape_ControlCharsU0000toU001F testJsonEscape_ProducesValidJSONWhenInterpolated testModelsResponse_ModelIdWithQuotes_IsJsonSafe testModelsResponse_SlashInModelId_IsSafe testSeed_RandomIsWithinIntMax (1000 iterations) testSeed_StepperBinding_ClampsSafely testRoleMapping_ToolRoleMapsToChatMessageTool testRoleMapping_DeveloperRoleMapsToSystem testRoleMapping_UnknownRoleFallsToUser

@published

…icker The crash 'Publishing changes from within view updates is not allowed' was occurring because the appearance.preference @published property was being mutated directly by a Picker inside a ScrollView during SwiftUI's layout pass. Fixes: 1. Extracted Color Scheme settings into a dedicated Appearance tab to isolate it from the Engine tab's layout cycle. 2. Implemented a custom Binding in the Picker that defers the @published write using Task { @mainactor in }. This explicitly breaks out of the current view update pass before mutating the AppearanceStore.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

SwiftBuddy/SwiftBuddy/ViewModels/ServerManager.swift:266

ServerManager.start() stores an unstructured Task { ... } and then mutates @MainActor-isolated state (isOnline, host, port, etc.) inside that task. Since Task { ... } does not inherit the MainActor, this will either fail to compile under strict concurrency or produce actor-isolation violations at runtime. Update the task to hop back to the MainActor for these assignments (e.g., await MainActor.run { ... }) or create the task as Task { @MainActor in ... } and move the server run off-main if needed.

                let app = Application(
                    router: router,
                    configuration: .init(address: .hostname(configuration.host, port: configuration.port))
                )

                self.isOnline = true
                self.host = configuration.host
                self.port = configuration.port
                self.runningConfiguration = configuration
                self.restartRequired = false

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-05-01T04:49:03Z

+    func testStrip_NoThinkBlock_ReturnsTrimmedOriginal() {
+        let input = "  Hello, how can I help?  "
+        XCTAssertEqual(stripThinkingTags(from: input), "Hello, how can I help?")
+    }


This test asserts trimming behavior when there is no <think> tag present, but the production helper intentionally preserves messages byte-for-byte when no tag was removed (to avoid breaking indentation/code blocks). Once the test is wired to the production implementation, this expectation should be updated to reflect the intended whitespace-preservation semantics.

Implemented. The test now uses the production stripThinkingTags helper via @testable import MLXInferenceCore, and the no-<think> case asserts the original string is preserved unchanged rather than trimmed.

Copilot · 2026-05-01T04:49:03Z

+            // SSD expert streaming:
+            // - MoE catalog models default ON (required to fit in RAM)
+            // - User can override via GenerationConfig.streamExperts for custom/non-catalog models
+            // - isMoE acts as the default; user toggle overrides both ways
+            let shouldStream = isMoE || GenerationConfig.load().streamExperts
+            if shouldStream {


The comment/docs indicate streamExperts can override the MoE default (including force-disable), but shouldStream = isMoE || GenerationConfig.load().streamExperts makes streaming impossible to turn off for MoE models. Either adjust the logic so the persisted user setting is authoritative after the initial defaulting, or update the docs/UI to reflect that MoE streaming is always-on.

Implemented in 321fc21. I added persisted-config awareness so MoE models default SSD streaming on only before the user has saved a preference; after that, the saved setting is authoritative and can force-disable streaming.

Copilot · 2026-05-01T04:49:04Z

+    /// Build the equivalent `swift run SwiftLM` command from current settings.
+    private var cliCommand: String {
+        let cfg = viewModel.config
+        var parts: [String] = []
+
+        if case .ready(let id) = engine.state {
+            parts.append("--model \(id)")
+        } else {
+            parts.append("--model <model-id>")
+        }
+
+        parts.append("--host \(server.host)")
+        parts.append("--port \(server.port)")
+        parts.append("--max-tokens \(cfg.maxTokens)")
+        parts.append("--temp \(String(format: "%.2f", cfg.temperature))")
+
+        if cfg.topP < 1.0              { parts.append("--top-p \(String(format: "%.2f", cfg.topP))") }
+        if cfg.topK != 50              { parts.append("--top-k \(cfg.topK)") }
+        if cfg.minP > 0                { parts.append("--min-p \(String(format: "%.2f", cfg.minP))") }
+        if cfg.repetitionPenalty != 1.05 { parts.append("--repeat-penalty \(String(format: "%.2f", cfg.repetitionPenalty))") }
+        if cfg.prefillSize != 512      { parts.append("--prefill-size \(cfg.prefillSize)") }
+        if let kv = cfg.kvBits {
+            parts.append("--kv-bits \(kv)")
+            if cfg.kvGroupSize != 64   { parts.append("--kv-group-size \(cfg.kvGroupSize)") }
+        }
+        if cfg.enableThinking          { parts.append("--thinking") }
+        if let seed = cfg.seed         { parts.append("--seed \(seed)") }
+        if server.startupConfiguration.parallelSlots > 1 {
+            parts.append("--parallel \(server.startupConfiguration.parallelSlots)")
+        }
+        if !server.startupConfiguration.apiKey.isEmpty { parts.append("--api-key <redacted>") }
+
+        return "swift run SwiftLM " + parts.joined(separator: " \\\n  ")
+    }


cliCommand here duplicates the new buildCLICommand(...) helper added in MLXInferenceCore. To avoid the Settings UI and the shared builder drifting (flag defaults, formatting, redaction), consider delegating to buildCLICommand(config:host:port:parallel:apiKeySet:modelId:) instead of maintaining a second copy.

Implemented in 321fc21. SettingsView.cliCommand now delegates to the shared buildCLICommand(...) helper instead of maintaining a second formatter.

Copilot · 2026-05-01T04:49:04Z

+                    Picker("", selection: Binding(
+                        get: { appearance.preference },
+                        set: { newValue in
+                            localColorScheme = newValue
+                            // Defer the @Published write to avoid the view update crash
+                            Task { @MainActor in
+                                appearance.preference = newValue
+                            }
+                        }


The Appearance picker stores changes into localColorScheme, but the picker’s selection getter still reads from appearance.preference. This can cause the UI to momentarily snap back (and localColorScheme is effectively unused). If the goal is to avoid publishing changes during view updates and keep the picker responsive, bind the picker selection to localColorScheme and then asynchronously propagate that value to appearance.preference.

Implemented. The Appearance picker is now bound to local state and asynchronously propagates to appearance.preference, which avoids the snap-back/update-cycle issue.

Copilot AI review requested due to automatic review settings April 30, 2026 17:27

Copilot started reviewing on behalf of solderzzc April 30, 2026 17:27 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

github-actions Bot added 5 commits April 30, 2026 10:53

solderzzc requested a review from Copilot April 30, 2026 19:56

Copilot started reviewing on behalf of solderzzc April 30, 2026 19:56 View session

Copilot AI reviewed Apr 30, 2026

View reviewed changes

github-actions Bot added 4 commits April 30, 2026 13:09

solderzzc force-pushed the fix/qwen3-jinja-template-issue-97 branch from e335d1d to 4ac0c23 Compare April 30, 2026 21:09

Add model loading progress for reloads

dcc0a3a

solderzzc requested a review from Copilot May 1, 2026 04:43

Copilot started reviewing on behalf of solderzzc May 1, 2026 04:44 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Fix persisted SSD streaming behavior

321fc21

solderzzc changed the title ~~fix(inference): resolve Qwen3 TemplateException on multi-turn chat (#97)~~ fix(inference/ui): resolve Qwen3 template issue and harden SwiftBuddy server/settings (#97) May 1, 2026

solderzzc merged commit aaf998a into main May 1, 2026
11 checks passed

solderzzc deleted the fix/qwen3-jinja-template-issue-97 branch May 1, 2026 17:25

Copilot AI mentioned this pull request May 3, 2026

fix: address post-merge PR 99 feedback and tests #101

Merged

	viewModel.config.seed = UInt64.random(in: 0...UInt64.max)
	viewModel.config.seed = UInt64.random(in: 0...UInt64(Int.max))

		let chunk = "{\"id\":\"\(reqId)\",\"object\":\"chat.completion.chunk\",\"created\":\(created),\"model\":\"\(modelId)\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\(jsonEscape(token.text))\"},\"finish_reason\":null}]}"
		cont.yield(ByteBuffer(string: "data: \(chunk)\n\n"))

	let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"
	let body = "{\"object\":\"list\",\"data\":[{\"id\":\(swiftBuddyJSONString(modelId)),\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"

Conversation

solderzzc commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Root Cause

What Changed

Validation

Follow-up

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI May 1, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

solderzzc commented Apr 30, 2026 •

edited

Loading