Conversation
…ssue #97) Two bugs caused every second prompt to fail with 'Jinja.TemplateException error 1' on Qwen3.5-122B-A10B-4bit: 1. Role mapping regression: 'assistant' was being remapped to 'model' (a Gemini-specific alias) before calling applyChatTemplate. Qwen3's Jinja template only accepts 'assistant' — any other value causes TemplateException error 1 on the first multi-turn request. 2. <think> tags leaking into history: when thinking mode is active, the model's reply includes raw <think>…</think> blocks. These were stored verbatim in the conversation history and re-submitted to the Jinja renderer on the next turn, triggering a second crash path. Fix: - Remove the 'assistant' → 'model' remapping entirely. 'assistant' is the correct OpenAI-compatible role name for all non-Gemini models. - Add stripThinkingTags() helper that removes all <think>…</think> spans (including unclosed tags and trailing newlines) from assistant history messages before they enter the chat template. Tests: 12 new cases in ThinkingTagStripTests covering single/multiple/ multiline/unclosed blocks, the exact Issue #97 message shape, and role rawValue regression guards. Fixes #97
There was a problem hiding this comment.
Pull request overview
Fixes multi-turn Qwen3 chat failures by keeping canonical message roles and sanitizing assistant history so Jinja chat templates don’t see invalid role names or raw <think> blocks.
Changes:
- Adds a helper to strip
<think>…</think>blocks from assistant history messages before applying the chat template. - Removes the runtime remapping of
"assistant"→"model"when preparing template messages. - Adds regression tests covering
<think>stripping behavior and role rawValue expectations.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
Sources/MLXInferenceCore/InferenceEngine.swift |
Stops remapping assistant roles and strips <think> tags from assistant history before templating. |
tests/SwiftLMTests/ThinkingTagStripTests.swift |
Adds regression tests for <think> tag stripping and role raw values. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| break | ||
| } | ||
| } | ||
| return result.trimmingCharacters(in: .whitespacesAndNewlines) | ||
| } |
There was a problem hiding this comment.
stripThinkingTags trims .whitespacesAndNewlines even when no <think> tags are present (and it’s called for every .assistant history message). This can unintentionally alter assistant content formatting (e.g., leading indentation/code blocks). Consider only removing the <think>…</think> span(s) and the single newline after </think>, and avoid global trimming (or restrict trimming to newlines introduced by stripping, and only when a tag was actually removed).
| import XCTest | ||
| import Foundation | ||
| @testable import SwiftLM | ||
| import MLXInferenceCore |
There was a problem hiding this comment.
This test target currently imports MLXInferenceCore, but Package.swift declares SwiftLMTests depending only on SwiftLM. If SwiftPM enforces direct dependencies for imports, this will fail with “no such module MLXInferenceCore”. Either add MLXInferenceCore to the SwiftLMTests target dependencies or remove the direct import and reference the needed types via a module that’s already a declared dependency.
| import MLXInferenceCore |
| // ── Mirror of the production helper (InferenceEngine.swift) ─────────────── | ||
| // Keep in sync if the production implementation changes. | ||
|
|
||
| private func stripThinkingTags(from text: String) -> String { | ||
| var result = text | ||
| while let openRange = result.range(of: "<think>") { | ||
| if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) { | ||
| var endIdx = closeRange.upperBound | ||
| if endIdx < result.endIndex && result[endIdx] == "\n" { | ||
| endIdx = result.index(after: endIdx) | ||
| } | ||
| result.removeSubrange(openRange.lowerBound..<endIdx) | ||
| } else { | ||
| result.removeSubrange(openRange.lowerBound...) | ||
| break | ||
| } | ||
| } | ||
| return result.trimmingCharacters(in: .whitespacesAndNewlines) | ||
| } |
There was a problem hiding this comment.
These tests validate a locally-copied stripThinkingTags implementation rather than the production helper in InferenceEngine.swift, so they won’t catch regressions if production code changes (or if production behavior diverges). To make the regression coverage meaningful, consider moving the sanitization logic into an internal helper that can be tested via @testable import MLXInferenceCore, or factor message-sanitization into a testable unit that generate() uses.
| // MARK: - 4. Role mapping regression guard (Issue #97) | ||
| // ═══════════════════════════════════════════════════════════════════ | ||
| // The ChatCompletionRequest pipeline in Server.swift passes roles through | ||
| // as-is. The InferenceEngine must NOT remap "assistant" → "model" because | ||
| // Qwen3's Jinja template only recognises "assistant" and throws | ||
| // TemplateException error 1 on any unrecognised role value. | ||
|
|
||
| func testRoleMapping_AssistantRawValue_IsAssistant() { | ||
| // ChatMessage.Role.assistant.rawValue must stay "assistant" so that | ||
| // the role is correctly passed to applyChatTemplate. | ||
| // If someone changes the enum rawValue, this test fails loudly. | ||
| XCTAssertEqual( | ||
| ChatMessage.Role.assistant.rawValue, | ||
| "assistant", | ||
| "Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)" | ||
| ) | ||
| } | ||
|
|
||
| func testRoleMapping_AllRolesHaveExpectedRawValues() { |
There was a problem hiding this comment.
The “role mapping regression guard” here only asserts ChatMessage.Role.assistant.rawValue == "assistant", but the reported regression was a runtime remap (if roleRaw == "assistant" { roleRaw = "model" }) inside InferenceEngine.generate(). This test would still pass if that remapping logic were reintroduced. Consider adjusting the test naming/comments to reflect what’s actually covered, or add a test that exercises the message-to-template mapping used by InferenceEngine.generate() so the remap can’t silently return.
| // MARK: - 4. Role mapping regression guard (Issue #97) | |
| // ═══════════════════════════════════════════════════════════════════ | |
| // The ChatCompletionRequest pipeline in Server.swift passes roles through | |
| // as-is. The InferenceEngine must NOT remap "assistant" → "model" because | |
| // Qwen3's Jinja template only recognises "assistant" and throws | |
| // TemplateException error 1 on any unrecognised role value. | |
| func testRoleMapping_AssistantRawValue_IsAssistant() { | |
| // ChatMessage.Role.assistant.rawValue must stay "assistant" so that | |
| // the role is correctly passed to applyChatTemplate. | |
| // If someone changes the enum rawValue, this test fails loudly. | |
| XCTAssertEqual( | |
| ChatMessage.Role.assistant.rawValue, | |
| "assistant", | |
| "Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)" | |
| ) | |
| } | |
| func testRoleMapping_AllRolesHaveExpectedRawValues() { | |
| // MARK: - 4. ChatMessage.Role raw value guards (Issue #97 context) | |
| // ═══════════════════════════════════════════════════════════════════ | |
| // These assertions verify the canonical wire-format role strings exposed by | |
| // ChatMessage.Role. They intentionally do NOT exercise any runtime remapping | |
| // inside InferenceEngine.generate(); if a dedicated template-mapping test is | |
| // added elsewhere, it should cover that behavior directly. | |
| func testChatMessageRoleRawValue_Assistant_IsAssistant() { | |
| // Keep the enum's assistant role aligned with the OpenAI-compatible | |
| // protocol string. This protects against changing the rawValue itself, | |
| // but does not verify any runtime role transformation logic. | |
| XCTAssertEqual( | |
| ChatMessage.Role.assistant.rawValue, | |
| "assistant", | |
| "Role.assistant rawValue must be 'assistant', not 'model' — Issue #97 enum raw-value guard" | |
| ) | |
| } | |
| func testChatMessageRoleRawValues_AllRolesMatchProtocolStrings() { |
- Package.swift: add MLXInferenceCore to SwiftLMTests deps so the direct import compiles on CI (was working locally via transitive resolution only) - InferenceEngine.swift: make stripThinkingTags() internal (was private) so @testable import MLXInferenceCore gives tests direct access to production code - InferenceEngine.swift: only trim whitespace when a <think> tag was actually removed; messages without thinking content are returned byte-for-byte so leading indentation / code-block formatting is not altered - ThinkingTagStripTests: remove mirror copy of stripThinkingTags and call the real production function instead; update no-tag test to assert unchanged passthrough; tighten role-guard test comments to accurately describe scope
…unt, add /v1/chat/completions GenerationConfig persistence - Add Codable conformance + save()/load() backed by UserDefaults - ChatViewModel loads persisted config on init; didSet auto-saves on every change - systemPrompt now also persisted via UserDefaults (swiftlm.systemPrompt) - Reset to Defaults triggers didSet, so the reset is persisted too Thinking mode fix (was completely broken) - enable_thinking was never passed to the Jinja chat template - Qwen3's template checks for the 'enable_thinking' kwarg; without it thinking is always off regardless of the UI toggle - Now passes additionalContext: ["enable_thinking": true/false] to UserInput so the template correctly generates <think> blocks when enabled Context window alignment - Replace inaccurate stringLength/3.5 character heuristic with lmInput.text.tokens.shape[0] — the real prefill token count from MLX after container.prepare(). This is accurate for all scripts including CJK and code content. /v1/chat/completions endpoint (SwiftBuddy embedded server) - Add full OpenAI-compatible POST /v1/chat/completions handler - Supports streaming (text/event-stream SSE) and non-streaming modes - Per-request overrides for temperature, top_p, max_tokens, frequency_penalty - Server config starts from persisted GenerationConfig.load() so user settings apply to API calls too - /v1/models now returns the real loaded model ID instead of hardcoded 'local' - Uses AsyncStream<ByteBuffer> + .init(asyncSequence:) — same pattern as the production SwiftLM server
…ettings/thinking/API SettingsView — copyable endpoint card (Engine tab) - Replace plain host:port text with a tappable URL card - Shows Online/Offline dot with glow, full http://host:port in monospace - One-tap copy: doc.on.doc icon → checkmark for 2s, works on macOS + iOS - When online: shows 'Compatible with OpenAI SDK, LM Studio, Continue, Cursor' - Green border glow when server is live GenerationConfigPersistenceTests (20 new tests) - Codable round-trip: all fields including nil kvBits - Default values guard: prevents silently changing defaults - Save/load persistence contract via JSONEncoder/Decoder - Thinking mode: enable_thinking additionalContext mapping for both true/false - Codable survival of enableThinking toggle - Chat endpoint message mapping: system/user/assistant/unknown/missing content - Per-request override application and non-interference with other fields - stream flag defaulting to false per OpenAI spec
…config fields streamExperts / turboKV removed from GenerationConfig - Both were architecturally dead: streamExperts is auto-activated at load time via ModelCatalog.isMoE; turboKV had no downstream wiring in GenerateParameters or the mlx-lm call chain - Engine tab now shows an 'Advanced Engine' info card explaining SSD streaming is automatic for MoE models and directing users to kvBits for cache quantisation seed wired end-to-end - MLX.seed(seed) called before container.prepare() in generate() - Seed UI in Output card: lock icon to fix a seed, xmark to go random - Fixed seed shows 'same input → identical output' hint Settings applied toast (Generation tab) - .onChange watchers on all 10 config fields flash a green 'Applied — takes effect on next message' capsule for 2s - Makes clear no restart is needed: params are hot-applied per request CLI Equivalent card (Engine tab) - Computes the equivalent `swift run SwiftLM` command from live settings - Only emits non-default flags (keeps command readable) - Tap to copy; checkmark confirmation for 2s; horizontally scrollable - Shows real loaded model ID when available iOS Performance card fixed - Was displaced outside #if os(iOS) guard by previous edit
Comment 1 (InferenceEngine.swift:525) — already fixed: stripThinkingTags only trims whitespace when a tag was actually removed (guarded by the 'stripped' flag), so untouched assistant messages keep original formatting. Comment 2 (ThinkingTagStripTests.swift:15) — already fixed: MLXInferenceCore is a declared SwiftLMTests dependency; tests use @testable import MLXInferenceCore against the real module. Comment 3 (ThinkingTagStripTests.swift:37) — already fixed: Tests call the production stripThinkingTags() function directly, not a local copy. Comment 4 (ThinkingTagStripTests.swift:150) — fixed here: Added testRoleMapping_AssistantProducesAssistantNotModel_InWireDict() which replicates the exact message-dict build path from generate() and asserts ['role': 'assistant'] not ['role': 'model'], so the Issue #97 runtime remap cannot silently return without test failure. Also added testRoleMapping_ToolRoleIsPreservedInWireDict(). Also fixes: - SettingsView: string literal escaping in cliCommand separator - SettingsView: srv.parallelSlots → srv.startupConfiguration.parallelSlots
…d fields guard Context clarification - The production SwiftLM Server.swift /v1/chat/completions is what OpenCode uses and is already exercised by ChatRequestParsingTests + ServerSSETests. - PR #99 added a SECOND /v1/chat/completions inside the SwiftBuddy embedded server (ServerManager.swift). These tests cover that new path. New: SwiftBuddyServerTests (13 tests) /v1/models response shape - testModelsResponse_MatchesOpenAISchema: object/data/id/object fields - testModelsResponse_FallsBackToLocalWhenNoModelLoaded SwiftBuddy SSE delta wire format - testSSEDeltaChunk_HasCorrectPrefix: 'data: ' prefix + CRLF CRLF suffix - testSSEDeltaChunk_JSONShape: object/id/model/choices/delta structure - testSSEDeltaChunk_EscapesSpecialCharacters: newlines in content - testSSEDoneTerminator_Format: 'data: [DONE]\r\n\r\n' - testSSEDeltaChunk_FinishReasonNull_DuringStreaming - testSSEDeltaChunk_FinishReasonStop_AtEnd CLI command builder (buildCLICommand extracted to MLXInferenceCore) - testCLIBuilder_DefaultsOmitNonDefaultFlags - testCLIBuilder_NonDefaultsFlagsEmitted - testCLIBuilder_NoModelId_UsesPlaceholder - testCLIBuilder_KvBitsDefault_DoesNotEmitGroupSize - testCLIBuilder_OutputStartsWithSwiftRunSwiftLM New: GenerationConfigPersistenceTests +1 - testGenerationConfig_RemovedFields_AbsentFromJSON: verifies turboKV and streamExperts are not in the Codable schema, preventing silent re-addition of dead fields Refactor: SettingsView.cliCommand → buildCLICommand() - Extracted 50-line inline compute to MLXInferenceCore/CLICommandBuilder.swift - SettingsView now delegates to buildCLICommand() — pure, testable function - No behaviour change
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // stripThinkingTags is private at file scope in InferenceEngine.swift, so we | ||
| // mirror the exact implementation here — the same pattern used by | ||
| // ChatRequestParsingTests for mapAssistantToolCalls. | ||
|
|
||
| import XCTest | ||
| import Foundation | ||
| @testable import SwiftLM | ||
| import MLXInferenceCore | ||
|
|
||
| final class ThinkingTagStripTests: XCTestCase { | ||
|
|
||
| // ── Mirror of the production helper (InferenceEngine.swift) ─────────────── | ||
| // Keep in sync if the production implementation changes. | ||
|
|
||
| private func stripThinkingTags(from text: String) -> String { | ||
| var result = text | ||
| while let openRange = result.range(of: "<think>") { | ||
| if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) { | ||
| var endIdx = closeRange.upperBound | ||
| if endIdx < result.endIndex && result[endIdx] == "\n" { | ||
| endIdx = result.index(after: endIdx) | ||
| } | ||
| result.removeSubrange(openRange.lowerBound..<endIdx) | ||
| } else { | ||
| result.removeSubrange(openRange.lowerBound...) | ||
| break | ||
| } | ||
| } | ||
| return result.trimmingCharacters(in: .whitespacesAndNewlines) | ||
| } | ||
|
|
There was a problem hiding this comment.
The comment says stripThinkingTags is private in production, but in InferenceEngine.swift it’s currently a module-internal top-level function. Instead of mirroring the implementation in tests (which can drift), consider @testable import MLXInferenceCore and calling the production stripThinkingTags(from:) directly.
| // stripThinkingTags is private at file scope in InferenceEngine.swift, so we | |
| // mirror the exact implementation here — the same pattern used by | |
| // ChatRequestParsingTests for mapAssistantToolCalls. | |
| import XCTest | |
| import Foundation | |
| @testable import SwiftLM | |
| import MLXInferenceCore | |
| final class ThinkingTagStripTests: XCTestCase { | |
| // ── Mirror of the production helper (InferenceEngine.swift) ─────────────── | |
| // Keep in sync if the production implementation changes. | |
| private func stripThinkingTags(from text: String) -> String { | |
| var result = text | |
| while let openRange = result.range(of: "<think>") { | |
| if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) { | |
| var endIdx = closeRange.upperBound | |
| if endIdx < result.endIndex && result[endIdx] == "\n" { | |
| endIdx = result.index(after: endIdx) | |
| } | |
| result.removeSubrange(openRange.lowerBound..<endIdx) | |
| } else { | |
| result.removeSubrange(openRange.lowerBound...) | |
| break | |
| } | |
| } | |
| return result.trimmingCharacters(in: .whitespacesAndNewlines) | |
| } | |
| // These tests call the production stripThinkingTags(from:) helper directly via | |
| // @testable import so they stay aligned with the implementation in | |
| // MLXInferenceCore. | |
| import XCTest | |
| import Foundation | |
| @testable import SwiftLM | |
| @testable import MLXInferenceCore | |
| final class ThinkingTagStripTests: XCTestCase { |
| @Published var config: GenerationConfig = .load() { | ||
| didSet { config.save() } | ||
| } |
There was a problem hiding this comment.
Persisting GenerationConfig on every didSet can cause a lot of synchronous UserDefaults writes while the user drags sliders (temperature/topP/etc.), which may lead to UI jank. Consider debouncing saves (e.g., save after a short idle delay) or saving only on explicit apply / view dismissal.
| @Published var systemPrompt: String = { | ||
| UserDefaults.standard.string(forKey: "swiftlm.systemPrompt") ?? "" | ||
| }() { | ||
| didSet { UserDefaults.standard.set(systemPrompt, forKey: "swiftlm.systemPrompt") } | ||
| } |
There was a problem hiding this comment.
systemPrompt is persisted to UserDefaults on every didSet (i.e., every keystroke in the TextEditor), which can be unnecessarily expensive and may impact typing responsiveness. Consider debouncing this write (e.g., save after a short delay) or persisting on commit / view exit.
| .foregroundStyle(SwiftBuddyTheme.textTertiary) | ||
| .font(.callout) | ||
| Button { | ||
| viewModel.config.seed = UInt64.random(in: 0...UInt64.max) |
There was a problem hiding this comment.
This generates seeds across the full UInt64 range, but the UI Stepper currently converts the seed to Int. If a random seed exceeds Int.max, the view will crash when rendering the Stepper binding. Consider generating within 0...UInt64(Int.max) (or updating the UI to avoid Int conversion).
| viewModel.config.seed = UInt64.random(in: 0...UInt64.max) | |
| viewModel.config.seed = UInt64.random(in: 0...UInt64(Int.max)) |
| if streamRequested { | ||
| // ── SSE streaming ─────────────────────────────────── | ||
| var sseHeaders = HTTPFields() | ||
| sseHeaders.append(HTTPField(name: .contentType, value: "text/event-stream; charset=utf-8")) | ||
| sseHeaders.append(HTTPField(name: HTTPField.Name("Cache-Control")!, value: "no-cache")) |
There was a problem hiding this comment.
parallelSlots is part of the server configuration, but this handler doesn’t apply any concurrency limiting around generation. Without gating, multiple requests can run generation concurrently and overwhelm GPU/memory. Consider using an AsyncSemaphore (capacity = configuration.parallelSlots) around the generation loop, similar to Sources/SwiftLM/Server.swift.
| s.replacingOccurrences(of: "\\", with: "\\\\") | ||
| .replacingOccurrences(of: "\"", with: "\\\"") | ||
| .replacingOccurrences(of: "\n", with: "\\n") | ||
| .replacingOccurrences(of: "\r", with: "\\r") | ||
| .replacingOccurrences(of: "\t", with: "\\t") |
There was a problem hiding this comment.
jsonEscape(_:) doesn’t escape all JSON control characters (e.g., U+0000…U+001F beyond \n/\r/\t), so certain model outputs can still produce invalid JSON. Consider switching to JSONEncoder for responses (or reuse swiftBuddyJSONString(_) for string fields) to guarantee correct escaping.
There was a problem hiding this comment.
Implemented. jsonEscape(_:) now uses JSONEncoder, so control characters beyond \n, \r, and \t are encoded correctly and token/full-response JSON stays valid.
| let chunk = "{\"id\":\"\(reqId)\",\"object\":\"chat.completion.chunk\",\"created\":\(created),\"model\":\"\(modelId)\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\(jsonEscape(token.text))\"},\"finish_reason\":null}]}" | ||
| cont.yield(ByteBuffer(string: "data: \(chunk)\n\n")) |
There was a problem hiding this comment.
Even with jsonEscape, modelId is interpolated into the SSE chunk without escaping. If the model id contains quotes/control chars, the chunk JSON becomes invalid. Prefer encoding the whole chunk with JSONEncoder (or at least JSON-escape modelId).
There was a problem hiding this comment.
Implemented. The SSE path now uses an escaped model id instead of interpolating the raw value, so quoted or control-character-containing ids can no longer break the chunk JSON.
| private func flashApplied() { | ||
| withAnimation { showAppliedBadge = true } | ||
| DispatchQueue.main.asyncAfter(deadline: .now() + 2) { | ||
| withAnimation { showAppliedBadge = false } | ||
| } |
There was a problem hiding this comment.
flashApplied() schedules a hide 2s later every time it’s called. Rapid successive changes (e.g., dragging sliders) can stack multiple delayed hides and cause flicker or early disappearance. Consider debouncing/canceling the previously scheduled hide work item before scheduling a new one.
There was a problem hiding this comment.
Implemented. flashApplied() now cancels any pending hide work item before scheduling a new one, so repeated slider or toggle updates do not stack delayed hides and flicker.
| case .ready(let id): modelId = id | ||
| default: modelId = "none" | ||
| } | ||
| let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}" |
There was a problem hiding this comment.
The /v1/models response is built via string interpolation without JSON-escaping modelId. If it contains quotes/control characters, this becomes invalid JSON (and can enable response injection). Prefer using swiftBuddyJSONString(modelId) / JSONEncoder to safely encode the string field.
| let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}" | |
| let body = "{\"object\":\"list\",\"data\":[{\"id\":\(swiftBuddyJSONString(modelId)),\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}" |
There was a problem hiding this comment.
Implemented. /v1/models now JSON-encodes the model id rather than interpolating it raw, which closes the invalid JSON / response-injection edge case.
| // Seed — optional reproducibility | ||
| HStack { | ||
| Label("Seed", systemImage: "number") | ||
| .foregroundStyle(SwiftBuddyTheme.textPrimary) | ||
| .font(.callout) |
There was a problem hiding this comment.
PR description focuses on fixing Qwen3 TemplateException + thinking-tag sanitization, but this PR also introduces additional user-facing features (seed control, applied badge, endpoint/CLI copy UI) and adds a /v1/chat/completions implementation + config persistence. If intentional, consider reflecting this scope in the PR description/title (or splitting into separate PRs) so reviewers can assess risk and release notes accurately.
There was a problem hiding this comment.
Agreed on scope. I kept the code changes on the branch, and I am tightening the PR description so the summary reflects the added server/settings work as well.
buildCLICommand() lives in MLXInferenceCore which is linked to the SwiftBuddy app target, but the SwiftBuddy Xcode target does not pick up new source files added to a local package without a package resolve. Fix: inline the equivalent logic directly in SettingsView.cliCommand. The public buildCLICommand() in MLXInferenceCore is retained for unit tests (SwiftBuddyServerTests) — the two implementations stay in sync by the test suite asserting the same flag-emission rules.
Problem 1: SSD Streaming and TurboKV not controllable
turboKV was removed prematurely — KVCacheSimple.turboQuantEnabled IS
a real, fully-wired path (same as Server.swift line 1541-1546).
streamExperts was removed, but the standalone server exposes --stream-experts
as a deliberate CLI flag for users to control on any model.
Fix:
- Restore turboKV to GenerationConfig (per-request, sets turboQuantEnabled
on every KVCacheSimple layer via container.perform before generate())
- Restore streamExperts to GenerationConfig (load-time preference; MoE
catalog models still default ON, but user can now override both ways)
- InferenceEngine.loadVerifiedModel(): shouldStream = isMoE || config.streamExperts
- UI: replace static info-only 'Advanced Engine' card with real toggles:
TurboKV toggle (instant, no reload)
SSD Expert Streaming toggle + inline 'Reload model' prompt when changed
Problem 2: Context window label confusion
Settings 'Max Tokens: 2048' = max OUTPUT tokens per response
Chat 'Context: X / 256K' = model's KV cache capacity from config.json
These are completely different things. The label was causing user confusion.
Fix:
- Rename slider label to 'Max Response Tokens'
- Hint now shows the model's actual context window size inline:
'Max output per reply. Model context window: 262K tokens'
Tests: testGenerationConfig_RestoredFields_PresentWithCorrectDefaults()
Updated to verify turboKV and streamExperts are present in schema
with correct defaults (false = user opt-in)
Critical fixes (crashes / invalid JSON / injection): C1/C2 — Seed UInt64 overflow crash (SettingsView.swift) Random generation clamped to 0...UInt64(Int.max) Stepper get: uses min(seed, UInt64(Int.max)) to prevent Int overflow trap C3 — jsonEscape misses U+0000–U+001F control chars (ServerManager.swift) Replaced 5-line manual replace chain with JSONEncoder-based escape. JSONEncoder guarantees ALL control chars are safely encoded per RFC 8259. C4 — Raw modelId interpolated in SSE chunks (ServerManager.swift) escapedModelId = swiftBuddyJSONString(modelId) computed once, used in both streaming (SSE chunk) and non-streaming (response body) paths. C5 — Raw modelId interpolated in /v1/models (ServerManager.swift) Now uses swiftBuddyJSONString(modelId) — same JSONEncoder-backed helper already used for the /health route host field. Medium fixes (correctness / UX): M1 — tool/developer roles dropped (ServerManager.swift) tool → .tool (required for OpenAI function-calling protocol) developer → .system (OpenAI Responses API convention) Unknown roles still fall through to .user (safe default, not rejected) M2 — Toast flicker on rapid slider drag (SettingsView.swift) flashApplied() now cancels the previous DispatchWorkItem before scheduling a new delayed hide, preventing stacked closures from firing in rapid succession. M3/M4 — UserDefaults saturated during slider drag (ChatViewModel.swift) config.save() and systemPrompt persist debounced at 0.5 s via DispatchWorkItem cancel+reschedule, eliminating write pressure during continuous slider movement and keystroke input. L1 — Doc comment said UInt32, impl uses UInt64 (GenerationConfig.swift) Corrected to match the actual MLX.seed(UInt64) call site. New tests (SwiftBuddyServerTests — 101 tests, was 91): testJsonEscape_BasicChars testJsonEscape_ControlCharsU0000toU001F testJsonEscape_ProducesValidJSONWhenInterpolated testModelsResponse_ModelIdWithQuotes_IsJsonSafe testModelsResponse_SlashInModelId_IsSafe testSeed_RandomIsWithinIntMax (1000 iterations) testSeed_StepperBinding_ClampsSafely testRoleMapping_ToolRoleMapsToChatMessageTool testRoleMapping_DeveloperRoleMapsToSystem testRoleMapping_UnknownRoleFallsToUser
…icker The crash 'Publishing changes from within view updates is not allowed' was occurring because the appearance.preference @published property was being mutated directly by a Picker inside a ScrollView during SwiftUI's layout pass. Fixes: 1. Extracted Color Scheme settings into a dedicated Appearance tab to isolate it from the Engine tab's layout cycle. 2. Implemented a custom Binding in the Picker that defers the @published write using Task { @mainactor in }. This explicitly breaks out of the current view update pass before mutating the AppearanceStore.
e335d1d to
4ac0c23
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
SwiftBuddy/SwiftBuddy/ViewModels/ServerManager.swift:266
ServerManager.start()stores an unstructuredTask { ... }and then mutates@MainActor-isolated state (isOnline,host,port, etc.) inside that task. SinceTask { ... }does not inherit the MainActor, this will either fail to compile under strict concurrency or produce actor-isolation violations at runtime. Update the task to hop back to the MainActor for these assignments (e.g.,await MainActor.run { ... }) or create the task asTask { @MainActor in ... }and move the server run off-main if needed.
let app = Application(
router: router,
configuration: .init(address: .hostname(configuration.host, port: configuration.port))
)
self.isOnline = true
self.host = configuration.host
self.port = configuration.port
self.runningConfiguration = configuration
self.restartRequired = false
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| func testStrip_NoThinkBlock_ReturnsTrimmedOriginal() { | ||
| let input = " Hello, how can I help? " | ||
| XCTAssertEqual(stripThinkingTags(from: input), "Hello, how can I help?") | ||
| } |
There was a problem hiding this comment.
This test asserts trimming behavior when there is no <think> tag present, but the production helper intentionally preserves messages byte-for-byte when no tag was removed (to avoid breaking indentation/code blocks). Once the test is wired to the production implementation, this expectation should be updated to reflect the intended whitespace-preservation semantics.
There was a problem hiding this comment.
Implemented. The test now uses the production stripThinkingTags helper via @testable import MLXInferenceCore, and the no-<think> case asserts the original string is preserved unchanged rather than trimmed.
| // SSD expert streaming: | ||
| // - MoE catalog models default ON (required to fit in RAM) | ||
| // - User can override via GenerationConfig.streamExperts for custom/non-catalog models | ||
| // - isMoE acts as the default; user toggle overrides both ways | ||
| let shouldStream = isMoE || GenerationConfig.load().streamExperts | ||
| if shouldStream { |
There was a problem hiding this comment.
The comment/docs indicate streamExperts can override the MoE default (including force-disable), but shouldStream = isMoE || GenerationConfig.load().streamExperts makes streaming impossible to turn off for MoE models. Either adjust the logic so the persisted user setting is authoritative after the initial defaulting, or update the docs/UI to reflect that MoE streaming is always-on.
There was a problem hiding this comment.
Implemented in 321fc21. I added persisted-config awareness so MoE models default SSD streaming on only before the user has saved a preference; after that, the saved setting is authoritative and can force-disable streaming.
| /// Build the equivalent `swift run SwiftLM` command from current settings. | ||
| private var cliCommand: String { | ||
| let cfg = viewModel.config | ||
| var parts: [String] = [] | ||
|
|
||
| if case .ready(let id) = engine.state { | ||
| parts.append("--model \(id)") | ||
| } else { | ||
| parts.append("--model <model-id>") | ||
| } | ||
|
|
||
| parts.append("--host \(server.host)") | ||
| parts.append("--port \(server.port)") | ||
| parts.append("--max-tokens \(cfg.maxTokens)") | ||
| parts.append("--temp \(String(format: "%.2f", cfg.temperature))") | ||
|
|
||
| if cfg.topP < 1.0 { parts.append("--top-p \(String(format: "%.2f", cfg.topP))") } | ||
| if cfg.topK != 50 { parts.append("--top-k \(cfg.topK)") } | ||
| if cfg.minP > 0 { parts.append("--min-p \(String(format: "%.2f", cfg.minP))") } | ||
| if cfg.repetitionPenalty != 1.05 { parts.append("--repeat-penalty \(String(format: "%.2f", cfg.repetitionPenalty))") } | ||
| if cfg.prefillSize != 512 { parts.append("--prefill-size \(cfg.prefillSize)") } | ||
| if let kv = cfg.kvBits { | ||
| parts.append("--kv-bits \(kv)") | ||
| if cfg.kvGroupSize != 64 { parts.append("--kv-group-size \(cfg.kvGroupSize)") } | ||
| } | ||
| if cfg.enableThinking { parts.append("--thinking") } | ||
| if let seed = cfg.seed { parts.append("--seed \(seed)") } | ||
| if server.startupConfiguration.parallelSlots > 1 { | ||
| parts.append("--parallel \(server.startupConfiguration.parallelSlots)") | ||
| } | ||
| if !server.startupConfiguration.apiKey.isEmpty { parts.append("--api-key <redacted>") } | ||
|
|
||
| return "swift run SwiftLM " + parts.joined(separator: " \\\n ") | ||
| } |
There was a problem hiding this comment.
cliCommand here duplicates the new buildCLICommand(...) helper added in MLXInferenceCore. To avoid the Settings UI and the shared builder drifting (flag defaults, formatting, redaction), consider delegating to buildCLICommand(config:host:port:parallel:apiKeySet:modelId:) instead of maintaining a second copy.
There was a problem hiding this comment.
Implemented in 321fc21. SettingsView.cliCommand now delegates to the shared buildCLICommand(...) helper instead of maintaining a second formatter.
| Picker("", selection: Binding( | ||
| get: { appearance.preference }, | ||
| set: { newValue in | ||
| localColorScheme = newValue | ||
| // Defer the @Published write to avoid the view update crash | ||
| Task { @MainActor in | ||
| appearance.preference = newValue | ||
| } | ||
| } |
There was a problem hiding this comment.
The Appearance picker stores changes into localColorScheme, but the picker’s selection getter still reads from appearance.preference. This can cause the UI to momentarily snap back (and localColorScheme is effectively unused). If the goal is to avoid publishing changes during view updates and keep the picker responsive, bind the picker selection to localColorScheme and then asynchronously propagate that value to appearance.preference.
There was a problem hiding this comment.
Implemented. The Appearance picker is now bound to local state and asynchronously propagates to appearance.preference, which avoids the snap-back/update-cycle issue.
Summary
Closes #97.
This PR fixes the Qwen3 multi-turn
Jinja.TemplateExceptionfailure and also lands the SwiftBuddy-side plumbing that was added alongside that fix: request/config persistence cleanup, safer embedded server JSON handling, and settings UX updates.Problem
On Qwen3.5-122B-A10B-4bit and other thinking-capable models, every second prompt could fail with:
At the same time, SwiftBuddy had a few adjacent issues in the same flow:
/v1/*responses were built with unsafe string interpolationRoot Cause
Two independent bugs in
InferenceEngine.generate()caused the Qwen3 failure:assistanthistory entries were being remapped tomodel, which Qwen3's Jinja template does not accept.<think>...</think>blocks were stored verbatim in assistant history and fed back into the template on later turns.What Changed
assistant -> modelremapping inInferenceEngine.generate().stripThinkingTags()to sanitize assistant history beforeapplyChatTemplate./v1/models, streaming SSE chunks, and chat responses.flashApplied()so repeated interactions do not stack delayed hides and flicker.buildCLICommand(...)helper from settings instead of duplicating CLI formatting.onfor MoE models only until the user has saved a preference; after that, the persisted setting becomes authoritative.Validation
swift build --target MLXInferenceCore --target SwiftLMThinkingTagStripTestsupdated to exercise the production helper semanticsFollow-up
The SwiftBuddy UI still renders thinking content directly from
ChatMessage.content. Splitting visible answer content from stored thinking content remains a separate UI-layer follow-up.