Skip to content

fix(inference/ui): resolve Qwen3 template issue and harden SwiftBuddy server/settings (#97)#99

Merged
solderzzc merged 13 commits intomainfrom
fix/qwen3-jinja-template-issue-97
May 1, 2026
Merged

fix(inference/ui): resolve Qwen3 template issue and harden SwiftBuddy server/settings (#97)#99
solderzzc merged 13 commits intomainfrom
fix/qwen3-jinja-template-issue-97

Conversation

@solderzzc
Copy link
Copy Markdown
Member

@solderzzc solderzzc commented Apr 30, 2026

Summary

Closes #97.

This PR fixes the Qwen3 multi-turn Jinja.TemplateException failure and also lands the SwiftBuddy-side plumbing that was added alongside that fix: request/config persistence cleanup, safer embedded server JSON handling, and settings UX updates.

Problem

On Qwen3.5-122B-A10B-4bit and other thinking-capable models, every second prompt could fail with:

[Error: The operation couldn't be completed. (Jinja.TemplateException error 1.)]

At the same time, SwiftBuddy had a few adjacent issues in the same flow:

  • thinking content could leak back into prompt history
  • embedded /v1/* responses were built with unsafe string interpolation
  • settings/UI state had drift and debounce issues
  • MoE SSD streaming defaulting could not be explicitly overridden after persistence

Root Cause

Two independent bugs in InferenceEngine.generate() caused the Qwen3 failure:

  1. assistant history entries were being remapped to model, which Qwen3's Jinja template does not accept.
  2. <think>...</think> blocks were stored verbatim in assistant history and fed back into the template on later turns.

What Changed

  • Removed the assistant -> model remapping in InferenceEngine.generate().
  • Added stripThinkingTags() to sanitize assistant history before applyChatTemplate.
  • Added regression coverage for Issue SwiftBuddy Chat: [Error: The operation couldn’t be completed. (Jinja.TemplateException error 1.)] #97, including whitespace-preservation behavior and runtime role-shape guards.
  • Hardened SwiftBuddy embedded server JSON generation for /v1/models, streaming SSE chunks, and chat responses.
  • Fixed flashApplied() so repeated interactions do not stack delayed hides and flicker.
  • Bound the Appearance picker through local view state before async propagation to avoid update-cycle snap-back.
  • Reused the shared buildCLICommand(...) helper from settings instead of duplicating CLI formatting.
  • Made SSD expert streaming default to on for MoE models only until the user has saved a preference; after that, the persisted setting becomes authoritative.

Validation

  • swift build --target MLXInferenceCore --target SwiftLM
  • ThinkingTagStripTests updated to exercise the production helper semantics

Follow-up

The SwiftBuddy UI still renders thinking content directly from ChatMessage.content. Splitting visible answer content from stored thinking content remains a separate UI-layer follow-up.

…ssue #97)

Two bugs caused every second prompt to fail with
'Jinja.TemplateException error 1' on Qwen3.5-122B-A10B-4bit:

1. Role mapping regression: 'assistant' was being remapped to 'model'
   (a Gemini-specific alias) before calling applyChatTemplate.
   Qwen3's Jinja template only accepts 'assistant' — any other value
   causes TemplateException error 1 on the first multi-turn request.

2. <think> tags leaking into history: when thinking mode is active,
   the model's reply includes raw <think>…</think> blocks. These were
   stored verbatim in the conversation history and re-submitted to the
   Jinja renderer on the next turn, triggering a second crash path.

Fix:
- Remove the 'assistant' → 'model' remapping entirely. 'assistant'
  is the correct OpenAI-compatible role name for all non-Gemini models.
- Add stripThinkingTags() helper that removes all <think>…</think>
  spans (including unclosed tags and trailing newlines) from assistant
  history messages before they enter the chat template.

Tests: 12 new cases in ThinkingTagStripTests covering single/multiple/
multiline/unclosed blocks, the exact Issue #97 message shape, and
role rawValue regression guards.

Fixes #97
Copilot AI review requested due to automatic review settings April 30, 2026 17:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multi-turn Qwen3 chat failures by keeping canonical message roles and sanitizing assistant history so Jinja chat templates don’t see invalid role names or raw <think> blocks.

Changes:

  • Adds a helper to strip <think>…</think> blocks from assistant history messages before applying the chat template.
  • Removes the runtime remapping of "assistant""model" when preparing template messages.
  • Adds regression tests covering <think> stripping behavior and role rawValue expectations.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
Sources/MLXInferenceCore/InferenceEngine.swift Stops remapping assistant roles and strips <think> tags from assistant history before templating.
tests/SwiftLMTests/ThinkingTagStripTests.swift Adds regression tests for <think> tag stripping and role raw values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +513 to +517
break
}
}
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stripThinkingTags trims .whitespacesAndNewlines even when no <think> tags are present (and it’s called for every .assistant history message). This can unintentionally alter assistant content formatting (e.g., leading indentation/code blocks). Consider only removing the <think>…</think> span(s) and the single newline after </think>, and avoid global trimming (or restrict trimming to newlines introduced by stripping, and only when a tag was actually removed).

Copilot uses AI. Check for mistakes.
import XCTest
import Foundation
@testable import SwiftLM
import MLXInferenceCore
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test target currently imports MLXInferenceCore, but Package.swift declares SwiftLMTests depending only on SwiftLM. If SwiftPM enforces direct dependencies for imports, this will fail with “no such module MLXInferenceCore”. Either add MLXInferenceCore to the SwiftLMTests target dependencies or remove the direct import and reference the needed types via a module that’s already a declared dependency.

Suggested change
import MLXInferenceCore

Copilot uses AI. Check for mistakes.
Comment on lines +19 to +37
// ── Mirror of the production helper (InferenceEngine.swift) ───────────────
// Keep in sync if the production implementation changes.

private func stripThinkingTags(from text: String) -> String {
var result = text
while let openRange = result.range(of: "<think>") {
if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {
var endIdx = closeRange.upperBound
if endIdx < result.endIndex && result[endIdx] == "\n" {
endIdx = result.index(after: endIdx)
}
result.removeSubrange(openRange.lowerBound..<endIdx)
} else {
result.removeSubrange(openRange.lowerBound...)
break
}
}
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests validate a locally-copied stripThinkingTags implementation rather than the production helper in InferenceEngine.swift, so they won’t catch regressions if production code changes (or if production behavior diverges). To make the regression coverage meaningful, consider moving the sanitization logic into an internal helper that can be tested via @testable import MLXInferenceCore, or factor message-sanitization into a testable unit that generate() uses.

Copilot uses AI. Check for mistakes.
Comment on lines +132 to +150
// MARK: - 4. Role mapping regression guard (Issue #97)
// ═══════════════════════════════════════════════════════════════════
// The ChatCompletionRequest pipeline in Server.swift passes roles through
// as-is. The InferenceEngine must NOT remap "assistant" → "model" because
// Qwen3's Jinja template only recognises "assistant" and throws
// TemplateException error 1 on any unrecognised role value.

func testRoleMapping_AssistantRawValue_IsAssistant() {
// ChatMessage.Role.assistant.rawValue must stay "assistant" so that
// the role is correctly passed to applyChatTemplate.
// If someone changes the enum rawValue, this test fails loudly.
XCTAssertEqual(
ChatMessage.Role.assistant.rawValue,
"assistant",
"Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)"
)
}

func testRoleMapping_AllRolesHaveExpectedRawValues() {
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The “role mapping regression guard” here only asserts ChatMessage.Role.assistant.rawValue == "assistant", but the reported regression was a runtime remap (if roleRaw == "assistant" { roleRaw = "model" }) inside InferenceEngine.generate(). This test would still pass if that remapping logic were reintroduced. Consider adjusting the test naming/comments to reflect what’s actually covered, or add a test that exercises the message-to-template mapping used by InferenceEngine.generate() so the remap can’t silently return.

Suggested change
// MARK: - 4. Role mapping regression guard (Issue #97)
// ═══════════════════════════════════════════════════════════════════
// The ChatCompletionRequest pipeline in Server.swift passes roles through
// as-is. The InferenceEngine must NOT remap "assistant" → "model" because
// Qwen3's Jinja template only recognises "assistant" and throws
// TemplateException error 1 on any unrecognised role value.
func testRoleMapping_AssistantRawValue_IsAssistant() {
// ChatMessage.Role.assistant.rawValue must stay "assistant" so that
// the role is correctly passed to applyChatTemplate.
// If someone changes the enum rawValue, this test fails loudly.
XCTAssertEqual(
ChatMessage.Role.assistant.rawValue,
"assistant",
"Role.assistant rawValue must be 'assistant', not 'model' — Qwen3 Jinja template fix (Issue #97)"
)
}
func testRoleMapping_AllRolesHaveExpectedRawValues() {
// MARK: - 4. ChatMessage.Role raw value guards (Issue #97 context)
// ═══════════════════════════════════════════════════════════════════
// These assertions verify the canonical wire-format role strings exposed by
// ChatMessage.Role. They intentionally do NOT exercise any runtime remapping
// inside InferenceEngine.generate(); if a dedicated template-mapping test is
// added elsewhere, it should cover that behavior directly.
func testChatMessageRoleRawValue_Assistant_IsAssistant() {
// Keep the enum's assistant role aligned with the OpenAI-compatible
// protocol string. This protects against changing the rawValue itself,
// but does not verify any runtime role transformation logic.
XCTAssertEqual(
ChatMessage.Role.assistant.rawValue,
"assistant",
"Role.assistant rawValue must be 'assistant', not 'model' — Issue #97 enum raw-value guard"
)
}
func testChatMessageRoleRawValues_AllRolesMatchProtocolStrings() {

Copilot uses AI. Check for mistakes.
- Package.swift: add MLXInferenceCore to SwiftLMTests deps so the direct
  import compiles on CI (was working locally via transitive resolution only)
- InferenceEngine.swift: make stripThinkingTags() internal (was private) so
  @testable import MLXInferenceCore gives tests direct access to production code
- InferenceEngine.swift: only trim whitespace when a <think> tag was actually
  removed; messages without thinking content are returned byte-for-byte so
  leading indentation / code-block formatting is not altered
- ThinkingTagStripTests: remove mirror copy of stripThinkingTags and call the
  real production function instead; update no-tag test to assert unchanged
  passthrough; tighten role-guard test comments to accurately describe scope
…unt, add /v1/chat/completions

GenerationConfig persistence
- Add Codable conformance + save()/load() backed by UserDefaults
- ChatViewModel loads persisted config on init; didSet auto-saves on every change
- systemPrompt now also persisted via UserDefaults (swiftlm.systemPrompt)
- Reset to Defaults triggers didSet, so the reset is persisted too

Thinking mode fix (was completely broken)
- enable_thinking was never passed to the Jinja chat template
- Qwen3's template checks for the 'enable_thinking' kwarg; without it
  thinking is always off regardless of the UI toggle
- Now passes additionalContext: ["enable_thinking": true/false] to UserInput
  so the template correctly generates <think> blocks when enabled

Context window alignment
- Replace inaccurate stringLength/3.5 character heuristic with
  lmInput.text.tokens.shape[0] — the real prefill token count from MLX
  after container.prepare(). This is accurate for all scripts including
  CJK and code content.

/v1/chat/completions endpoint (SwiftBuddy embedded server)
- Add full OpenAI-compatible POST /v1/chat/completions handler
- Supports streaming (text/event-stream SSE) and non-streaming modes
- Per-request overrides for temperature, top_p, max_tokens, frequency_penalty
- Server config starts from persisted GenerationConfig.load() so user
  settings apply to API calls too
- /v1/models now returns the real loaded model ID instead of hardcoded 'local'
- Uses AsyncStream<ByteBuffer> + .init(asyncSequence:) — same pattern as
  the production SwiftLM server
…ettings/thinking/API

SettingsView — copyable endpoint card (Engine tab)
- Replace plain host:port text with a tappable URL card
- Shows Online/Offline dot with glow, full http://host:port in monospace
- One-tap copy: doc.on.doc icon → checkmark for 2s, works on macOS + iOS
- When online: shows 'Compatible with OpenAI SDK, LM Studio, Continue, Cursor'
- Green border glow when server is live

GenerationConfigPersistenceTests (20 new tests)
- Codable round-trip: all fields including nil kvBits
- Default values guard: prevents silently changing defaults
- Save/load persistence contract via JSONEncoder/Decoder
- Thinking mode: enable_thinking additionalContext mapping for both true/false
- Codable survival of enableThinking toggle
- Chat endpoint message mapping: system/user/assistant/unknown/missing content
- Per-request override application and non-interference with other fields
- stream flag defaulting to false per OpenAI spec
…config fields

streamExperts / turboKV removed from GenerationConfig
- Both were architecturally dead: streamExperts is auto-activated at load
  time via ModelCatalog.isMoE; turboKV had no downstream wiring in
  GenerateParameters or the mlx-lm call chain
- Engine tab now shows an 'Advanced Engine' info card explaining
  SSD streaming is automatic for MoE models and directing users to
  kvBits for cache quantisation

seed wired end-to-end
- MLX.seed(seed) called before container.prepare() in generate()
- Seed UI in Output card: lock icon to fix a seed, xmark to go random
- Fixed seed shows 'same input → identical output' hint

Settings applied toast (Generation tab)
- .onChange watchers on all 10 config fields flash a green
  'Applied — takes effect on next message' capsule for 2s
- Makes clear no restart is needed: params are hot-applied per request

CLI Equivalent card (Engine tab)
- Computes the equivalent `swift run SwiftLM` command from live settings
- Only emits non-default flags (keeps command readable)
- Tap to copy; checkmark confirmation for 2s; horizontally scrollable
- Shows real loaded model ID when available

iOS Performance card fixed
- Was displaced outside #if os(iOS) guard by previous edit
Comment 1 (InferenceEngine.swift:525) — already fixed:
  stripThinkingTags only trims whitespace when a tag was actually
  removed (guarded by the 'stripped' flag), so untouched assistant
  messages keep original formatting.

Comment 2 (ThinkingTagStripTests.swift:15) — already fixed:
  MLXInferenceCore is a declared SwiftLMTests dependency; tests
  use @testable import MLXInferenceCore against the real module.

Comment 3 (ThinkingTagStripTests.swift:37) — already fixed:
  Tests call the production stripThinkingTags() function directly,
  not a local copy.

Comment 4 (ThinkingTagStripTests.swift:150) — fixed here:
  Added testRoleMapping_AssistantProducesAssistantNotModel_InWireDict()
  which replicates the exact message-dict build path from generate()
  and asserts ['role': 'assistant'] not ['role': 'model'], so the
  Issue #97 runtime remap cannot silently return without test failure.
  Also added testRoleMapping_ToolRoleIsPreservedInWireDict().

Also fixes:
  - SettingsView: string literal escaping in cliCommand separator
  - SettingsView: srv.parallelSlots → srv.startupConfiguration.parallelSlots
…d fields guard

Context clarification
- The production SwiftLM Server.swift /v1/chat/completions is what OpenCode
  uses and is already exercised by ChatRequestParsingTests + ServerSSETests.
- PR #99 added a SECOND /v1/chat/completions inside the SwiftBuddy embedded
  server (ServerManager.swift). These tests cover that new path.

New: SwiftBuddyServerTests (13 tests)
  /v1/models response shape
  - testModelsResponse_MatchesOpenAISchema: object/data/id/object fields
  - testModelsResponse_FallsBackToLocalWhenNoModelLoaded

  SwiftBuddy SSE delta wire format
  - testSSEDeltaChunk_HasCorrectPrefix: 'data: ' prefix + CRLF CRLF suffix
  - testSSEDeltaChunk_JSONShape: object/id/model/choices/delta structure
  - testSSEDeltaChunk_EscapesSpecialCharacters: newlines in content
  - testSSEDoneTerminator_Format: 'data: [DONE]\r\n\r\n'
  - testSSEDeltaChunk_FinishReasonNull_DuringStreaming
  - testSSEDeltaChunk_FinishReasonStop_AtEnd

  CLI command builder (buildCLICommand extracted to MLXInferenceCore)
  - testCLIBuilder_DefaultsOmitNonDefaultFlags
  - testCLIBuilder_NonDefaultsFlagsEmitted
  - testCLIBuilder_NoModelId_UsesPlaceholder
  - testCLIBuilder_KvBitsDefault_DoesNotEmitGroupSize
  - testCLIBuilder_OutputStartsWithSwiftRunSwiftLM

New: GenerationConfigPersistenceTests +1
  - testGenerationConfig_RemovedFields_AbsentFromJSON: verifies turboKV
    and streamExperts are not in the Codable schema, preventing silent
    re-addition of dead fields

Refactor: SettingsView.cliCommand → buildCLICommand()
  - Extracted 50-line inline compute to MLXInferenceCore/CLICommandBuilder.swift
  - SettingsView now delegates to buildCLICommand() — pure, testable function
  - No behaviour change
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 15 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +8 to +38
// stripThinkingTags is private at file scope in InferenceEngine.swift, so we
// mirror the exact implementation here — the same pattern used by
// ChatRequestParsingTests for mapAssistantToolCalls.

import XCTest
import Foundation
@testable import SwiftLM
import MLXInferenceCore

final class ThinkingTagStripTests: XCTestCase {

// ── Mirror of the production helper (InferenceEngine.swift) ───────────────
// Keep in sync if the production implementation changes.

private func stripThinkingTags(from text: String) -> String {
var result = text
while let openRange = result.range(of: "<think>") {
if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {
var endIdx = closeRange.upperBound
if endIdx < result.endIndex && result[endIdx] == "\n" {
endIdx = result.index(after: endIdx)
}
result.removeSubrange(openRange.lowerBound..<endIdx)
} else {
result.removeSubrange(openRange.lowerBound...)
break
}
}
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}

Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says stripThinkingTags is private in production, but in InferenceEngine.swift it’s currently a module-internal top-level function. Instead of mirroring the implementation in tests (which can drift), consider @testable import MLXInferenceCore and calling the production stripThinkingTags(from:) directly.

Suggested change
// stripThinkingTags is private at file scope in InferenceEngine.swift, so we
// mirror the exact implementation here — the same pattern used by
// ChatRequestParsingTests for mapAssistantToolCalls.
import XCTest
import Foundation
@testable import SwiftLM
import MLXInferenceCore
final class ThinkingTagStripTests: XCTestCase {
// ── Mirror of the production helper (InferenceEngine.swift) ───────────────
// Keep in sync if the production implementation changes.
private func stripThinkingTags(from text: String) -> String {
var result = text
while let openRange = result.range(of: "<think>") {
if let closeRange = result.range(of: "</think>", range: openRange.lowerBound..<result.endIndex) {
var endIdx = closeRange.upperBound
if endIdx < result.endIndex && result[endIdx] == "\n" {
endIdx = result.index(after: endIdx)
}
result.removeSubrange(openRange.lowerBound..<endIdx)
} else {
result.removeSubrange(openRange.lowerBound...)
break
}
}
return result.trimmingCharacters(in: .whitespacesAndNewlines)
}
// These tests call the production stripThinkingTags(from:) helper directly via
// @testable import so they stay aligned with the implementation in
// MLXInferenceCore.
import XCTest
import Foundation
@testable import SwiftLM
@testable import MLXInferenceCore
final class ThinkingTagStripTests: XCTestCase {

Copilot uses AI. Check for mistakes.
Comment on lines +15 to +17
@Published var config: GenerationConfig = .load() {
didSet { config.save() }
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Persisting GenerationConfig on every didSet can cause a lot of synchronous UserDefaults writes while the user drags sliders (temperature/topP/etc.), which may lead to UI jank. Consider debouncing saves (e.g., save after a short idle delay) or saving only on explicit apply / view dismissal.

Copilot uses AI. Check for mistakes.
Comment on lines +18 to +22
@Published var systemPrompt: String = {
UserDefaults.standard.string(forKey: "swiftlm.systemPrompt") ?? ""
}() {
didSet { UserDefaults.standard.set(systemPrompt, forKey: "swiftlm.systemPrompt") }
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

systemPrompt is persisted to UserDefaults on every didSet (i.e., every keystroke in the TextEditor), which can be unnecessarily expensive and may impact typing responsiveness. Consider debouncing this write (e.g., save after a short delay) or persisting on commit / view exit.

Copilot uses AI. Check for mistakes.
.foregroundStyle(SwiftBuddyTheme.textTertiary)
.font(.callout)
Button {
viewModel.config.seed = UInt64.random(in: 0...UInt64.max)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This generates seeds across the full UInt64 range, but the UI Stepper currently converts the seed to Int. If a random seed exceeds Int.max, the view will crash when rendering the Stepper binding. Consider generating within 0...UInt64(Int.max) (or updating the UI to avoid Int conversion).

Suggested change
viewModel.config.seed = UInt64.random(in: 0...UInt64.max)
viewModel.config.seed = UInt64.random(in: 0...UInt64(Int.max))

Copilot uses AI. Check for mistakes.
Comment on lines +216 to +220
if streamRequested {
// ── SSE streaming ───────────────────────────────────
var sseHeaders = HTTPFields()
sseHeaders.append(HTTPField(name: .contentType, value: "text/event-stream; charset=utf-8"))
sseHeaders.append(HTTPField(name: HTTPField.Name("Cache-Control")!, value: "no-cache"))
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parallelSlots is part of the server configuration, but this handler doesn’t apply any concurrency limiting around generation. Without gating, multiple requests can run generation concurrently and overwhelm GPU/memory. Consider using an AsyncSemaphore (capacity = configuration.parallelSlots) around the generation loop, similar to Sources/SwiftLM/Server.swift.

Copilot uses AI. Check for mistakes.
Comment on lines +209 to +213
s.replacingOccurrences(of: "\\", with: "\\\\")
.replacingOccurrences(of: "\"", with: "\\\"")
.replacingOccurrences(of: "\n", with: "\\n")
.replacingOccurrences(of: "\r", with: "\\r")
.replacingOccurrences(of: "\t", with: "\\t")
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

jsonEscape(_:) doesn’t escape all JSON control characters (e.g., U+0000…U+001F beyond \n/\r/\t), so certain model outputs can still produce invalid JSON. Consider switching to JSONEncoder for responses (or reuse swiftBuddyJSONString(_) for string fields) to guarantee correct escaping.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. jsonEscape(_:) now uses JSONEncoder, so control characters beyond \n, \r, and \t are encoded correctly and token/full-response JSON stays valid.

Comment on lines +226 to +227
let chunk = "{\"id\":\"\(reqId)\",\"object\":\"chat.completion.chunk\",\"created\":\(created),\"model\":\"\(modelId)\",\"choices\":[{\"index\":0,\"delta\":{\"content\":\"\(jsonEscape(token.text))\"},\"finish_reason\":null}]}"
cont.yield(ByteBuffer(string: "data: \(chunk)\n\n"))
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with jsonEscape, modelId is interpolated into the SSE chunk without escaping. If the model id contains quotes/control chars, the chunk JSON becomes invalid. Prefer encoding the whole chunk with JSONEncoder (or at least JSON-escape modelId).

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. The SSE path now uses an escaped model id instead of interpolating the raw value, so quoted or control-character-containing ids can no longer break the chunk JSON.

Comment on lines +788 to +792
private func flashApplied() {
withAnimation { showAppliedBadge = true }
DispatchQueue.main.asyncAfter(deadline: .now() + 2) {
withAnimation { showAppliedBadge = false }
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flashApplied() schedules a hide 2s later every time it’s called. Rapid successive changes (e.g., dragging sliders) can stack multiple delayed hides and cause flicker or early disappearance. Consider debouncing/canceling the previously scheduled hide work item before scheduling a new one.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. flashApplied() now cancels any pending hide work item before scheduling a new one, so repeated slider or toggle updates do not stack delayed hides and flicker.

case .ready(let id): modelId = id
default: modelId = "none"
}
let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The /v1/models response is built via string interpolation without JSON-escaping modelId. If it contains quotes/control characters, this becomes invalid JSON (and can enable response injection). Prefer using swiftBuddyJSONString(modelId) / JSONEncoder to safely encode the string field.

Suggested change
let body = "{\"object\":\"list\",\"data\":[{\"id\":\"\(modelId)\",\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"
let body = "{\"object\":\"list\",\"data\":[{\"id\":\(swiftBuddyJSONString(modelId)),\"object\":\"model\",\"owned_by\":\"swiftbuddy\"}]}"

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. /v1/models now JSON-encodes the model id rather than interpolating it raw, which closes the invalid JSON / response-injection edge case.

Comment on lines +215 to +219
// Seed — optional reproducibility
HStack {
Label("Seed", systemImage: "number")
.foregroundStyle(SwiftBuddyTheme.textPrimary)
.font(.callout)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description focuses on fixing Qwen3 TemplateException + thinking-tag sanitization, but this PR also introduces additional user-facing features (seed control, applied badge, endpoint/CLI copy UI) and adds a /v1/chat/completions implementation + config persistence. If intentional, consider reflecting this scope in the PR description/title (or splitting into separate PRs) so reviewers can assess risk and release notes accurately.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed on scope. I kept the code changes on the branch, and I am tightening the PR description so the summary reflects the added server/settings work as well.

buildCLICommand() lives in MLXInferenceCore which is linked to the
SwiftBuddy app target, but the SwiftBuddy Xcode target does not pick
up new source files added to a local package without a package resolve.

Fix: inline the equivalent logic directly in SettingsView.cliCommand.
The public buildCLICommand() in MLXInferenceCore is retained for unit
tests (SwiftBuddyServerTests) — the two implementations stay in sync
by the test suite asserting the same flag-emission rules.
Problem 1: SSD Streaming and TurboKV not controllable
  turboKV was removed prematurely — KVCacheSimple.turboQuantEnabled IS
  a real, fully-wired path (same as Server.swift line 1541-1546).
  streamExperts was removed, but the standalone server exposes --stream-experts
  as a deliberate CLI flag for users to control on any model.

Fix:
  - Restore turboKV to GenerationConfig (per-request, sets turboQuantEnabled
    on every KVCacheSimple layer via container.perform before generate())
  - Restore streamExperts to GenerationConfig (load-time preference; MoE
    catalog models still default ON, but user can now override both ways)
  - InferenceEngine.loadVerifiedModel(): shouldStream = isMoE || config.streamExperts
  - UI: replace static info-only 'Advanced Engine' card with real toggles:
      TurboKV toggle (instant, no reload)
      SSD Expert Streaming toggle + inline 'Reload model' prompt when changed

Problem 2: Context window label confusion
  Settings 'Max Tokens: 2048' = max OUTPUT tokens per response
  Chat 'Context: X / 256K' = model's KV cache capacity from config.json
  These are completely different things. The label was causing user confusion.

Fix:
  - Rename slider label to 'Max Response Tokens'
  - Hint now shows the model's actual context window size inline:
    'Max output per reply. Model context window: 262K tokens'

Tests: testGenerationConfig_RestoredFields_PresentWithCorrectDefaults()
  Updated to verify turboKV and streamExperts are present in schema
  with correct defaults (false = user opt-in)
Critical fixes (crashes / invalid JSON / injection):

C1/C2 — Seed UInt64 overflow crash (SettingsView.swift)
  Random generation clamped to 0...UInt64(Int.max)
  Stepper get: uses min(seed, UInt64(Int.max)) to prevent Int overflow trap

C3 — jsonEscape misses U+0000–U+001F control chars (ServerManager.swift)
  Replaced 5-line manual replace chain with JSONEncoder-based escape.
  JSONEncoder guarantees ALL control chars are safely encoded per RFC 8259.

C4 — Raw modelId interpolated in SSE chunks (ServerManager.swift)
  escapedModelId = swiftBuddyJSONString(modelId) computed once, used in
  both streaming (SSE chunk) and non-streaming (response body) paths.

C5 — Raw modelId interpolated in /v1/models (ServerManager.swift)
  Now uses swiftBuddyJSONString(modelId) — same JSONEncoder-backed helper
  already used for the /health route host field.

Medium fixes (correctness / UX):

M1 — tool/developer roles dropped (ServerManager.swift)
  tool     → .tool   (required for OpenAI function-calling protocol)
  developer → .system (OpenAI Responses API convention)
  Unknown roles still fall through to .user (safe default, not rejected)

M2 — Toast flicker on rapid slider drag (SettingsView.swift)
  flashApplied() now cancels the previous DispatchWorkItem before
  scheduling a new delayed hide, preventing stacked closures from
  firing in rapid succession.

M3/M4 — UserDefaults saturated during slider drag (ChatViewModel.swift)
  config.save() and systemPrompt persist debounced at 0.5 s via
  DispatchWorkItem cancel+reschedule, eliminating write pressure during
  continuous slider movement and keystroke input.

L1 — Doc comment said UInt32, impl uses UInt64 (GenerationConfig.swift)
  Corrected to match the actual MLX.seed(UInt64) call site.

New tests (SwiftBuddyServerTests — 101 tests, was 91):
  testJsonEscape_BasicChars
  testJsonEscape_ControlCharsU0000toU001F
  testJsonEscape_ProducesValidJSONWhenInterpolated
  testModelsResponse_ModelIdWithQuotes_IsJsonSafe
  testModelsResponse_SlashInModelId_IsSafe
  testSeed_RandomIsWithinIntMax (1000 iterations)
  testSeed_StepperBinding_ClampsSafely
  testRoleMapping_ToolRoleMapsToChatMessageTool
  testRoleMapping_DeveloperRoleMapsToSystem
  testRoleMapping_UnknownRoleFallsToUser
…icker

The crash 'Publishing changes from within view updates is not allowed' was occurring because the appearance.preference @published property was being mutated directly by a Picker inside a ScrollView during SwiftUI's layout pass.

Fixes:
1. Extracted Color Scheme settings into a dedicated Appearance tab to isolate it from the Engine tab's layout cycle.
2. Implemented a custom Binding in the Picker that defers the @published write using Task { @mainactor in }. This explicitly breaks out of the current view update pass before mutating the AppearanceStore.
@solderzzc solderzzc force-pushed the fix/qwen3-jinja-template-issue-97 branch from e335d1d to 4ac0c23 Compare April 30, 2026 21:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

SwiftBuddy/SwiftBuddy/ViewModels/ServerManager.swift:266

  • ServerManager.start() stores an unstructured Task { ... } and then mutates @MainActor-isolated state (isOnline, host, port, etc.) inside that task. Since Task { ... } does not inherit the MainActor, this will either fail to compile under strict concurrency or produce actor-isolation violations at runtime. Update the task to hop back to the MainActor for these assignments (e.g., await MainActor.run { ... }) or create the task as Task { @MainActor in ... } and move the server run off-main if needed.
                let app = Application(
                    router: router,
                    configuration: .init(address: .hostname(configuration.host, port: configuration.port))
                )

                self.isOnline = true
                self.host = configuration.host
                self.port = configuration.port
                self.runningConfiguration = configuration
                self.restartRequired = false

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +56
func testStrip_NoThinkBlock_ReturnsTrimmedOriginal() {
let input = " Hello, how can I help? "
XCTAssertEqual(stripThinkingTags(from: input), "Hello, how can I help?")
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test asserts trimming behavior when there is no <think> tag present, but the production helper intentionally preserves messages byte-for-byte when no tag was removed (to avoid breaking indentation/code blocks). Once the test is wired to the production implementation, this expectation should be updated to reflect the intended whitespace-preservation semantics.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. The test now uses the production stripThinkingTags helper via @testable import MLXInferenceCore, and the no-<think> case asserts the original string is preserved unchanged rather than trimmed.

Comment on lines +334 to +339
// SSD expert streaming:
// - MoE catalog models default ON (required to fit in RAM)
// - User can override via GenerationConfig.streamExperts for custom/non-catalog models
// - isMoE acts as the default; user toggle overrides both ways
let shouldStream = isMoE || GenerationConfig.load().streamExperts
if shouldStream {
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment/docs indicate streamExperts can override the MoE default (including force-disable), but shouldStream = isMoE || GenerationConfig.load().streamExperts makes streaming impossible to turn off for MoE models. Either adjust the logic so the persisted user setting is authoritative after the initial defaulting, or update the docs/UI to reflect that MoE streaming is always-on.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 321fc21. I added persisted-config awareness so MoE models default SSD streaming on only before the user has saved a preference; after that, the saved setting is authoritative and can force-disable streaming.

Comment on lines +886 to +919
/// Build the equivalent `swift run SwiftLM` command from current settings.
private var cliCommand: String {
let cfg = viewModel.config
var parts: [String] = []

if case .ready(let id) = engine.state {
parts.append("--model \(id)")
} else {
parts.append("--model <model-id>")
}

parts.append("--host \(server.host)")
parts.append("--port \(server.port)")
parts.append("--max-tokens \(cfg.maxTokens)")
parts.append("--temp \(String(format: "%.2f", cfg.temperature))")

if cfg.topP < 1.0 { parts.append("--top-p \(String(format: "%.2f", cfg.topP))") }
if cfg.topK != 50 { parts.append("--top-k \(cfg.topK)") }
if cfg.minP > 0 { parts.append("--min-p \(String(format: "%.2f", cfg.minP))") }
if cfg.repetitionPenalty != 1.05 { parts.append("--repeat-penalty \(String(format: "%.2f", cfg.repetitionPenalty))") }
if cfg.prefillSize != 512 { parts.append("--prefill-size \(cfg.prefillSize)") }
if let kv = cfg.kvBits {
parts.append("--kv-bits \(kv)")
if cfg.kvGroupSize != 64 { parts.append("--kv-group-size \(cfg.kvGroupSize)") }
}
if cfg.enableThinking { parts.append("--thinking") }
if let seed = cfg.seed { parts.append("--seed \(seed)") }
if server.startupConfiguration.parallelSlots > 1 {
parts.append("--parallel \(server.startupConfiguration.parallelSlots)")
}
if !server.startupConfiguration.apiKey.isEmpty { parts.append("--api-key <redacted>") }

return "swift run SwiftLM " + parts.joined(separator: " \\\n ")
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cliCommand here duplicates the new buildCLICommand(...) helper added in MLXInferenceCore. To avoid the Settings UI and the shared builder drifting (flag defaults, formatting, redaction), consider delegating to buildCLICommand(config:host:port:parallel:apiKeySet:modelId:) instead of maintaining a second copy.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented in 321fc21. SettingsView.cliCommand now delegates to the shared buildCLICommand(...) helper instead of maintaining a second formatter.

Comment on lines +669 to +677
Picker("", selection: Binding(
get: { appearance.preference },
set: { newValue in
localColorScheme = newValue
// Defer the @Published write to avoid the view update crash
Task { @MainActor in
appearance.preference = newValue
}
}
Copy link

Copilot AI May 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Appearance picker stores changes into localColorScheme, but the picker’s selection getter still reads from appearance.preference. This can cause the UI to momentarily snap back (and localColorScheme is effectively unused). If the goal is to avoid publishing changes during view updates and keep the picker responsive, bind the picker selection to localColorScheme and then asynchronously propagate that value to appearance.preference.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented. The Appearance picker is now bound to local state and asynchronously propagates to appearance.preference, which avoids the snap-back/update-cycle issue.

@solderzzc solderzzc changed the title fix(inference): resolve Qwen3 TemplateException on multi-turn chat (#97) fix(inference/ui): resolve Qwen3 template issue and harden SwiftBuddy server/settings (#97) May 1, 2026
@solderzzc solderzzc merged commit aaf998a into main May 1, 2026
11 checks passed
@solderzzc solderzzc deleted the fix/qwen3-jinja-template-issue-97 branch May 1, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SwiftBuddy Chat: [Error: The operation couldn’t be completed. (Jinja.TemplateException error 1.)]

2 participants