test(profiling): Update harness tolerance for 35B SSD-streaming by solderzzc · Pull Request #103 · SharpAI/SwiftLM

solderzzc · 2026-05-07T05:58:29Z

Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models.
Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Resolves #102

- Add enableMTP (Bool) and numMTPTokens (Int) to GenerationConfig - InferenceEngine.generate() routes to generateMTP() when both config.enableMTP is true and the loaded model conforms to MTPLanguageModel; graceful fallback to standard path otherwise

- Added --mtp and --num-mtp-tokens CLI flags to Server.swift - Automatically injects SWIFTLM_MTP_ENABLE=1 into environment during startup if --mtp is specified - Exposed MTP configuration to ServerConfig and startup logs - Refactored MLXLMCommon.generate invocations to call generateMTP() when MTP is enabled and the model conforms to MTPLanguageModel

- Added 'MTP Speculative Decoding' toggle to the Advanced Engine settings pane. - Added a dynamic slider to configure the number of MTP draft tokens per round (1-5). - Integrated MTP toggle with the engine auto-reloading mechanism, similar to SSD Streaming.

- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models. - Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Copilot

Pull request overview

This PR expands SwiftLM/SwiftBuddy support for MTP (Multi-Token Prediction) speculative decoding and updates profiling tooling to better benchmark large FP8 MoE models.

Changes:

Added MTP enablement + token/round configuration paths across the SwiftLM CLI server and SwiftBuddy settings UI (with auto-reload for load-time toggles).
Added inference-side MTP generation selection and exposed lightweight last-turn performance metrics (TTFT/prefill/decode throughput).
Updated profiling configuration/output handling and introduced a dedicated FP8 MTP benchmark harness script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
SwiftBuddy/SwiftBuddy/Views/SettingsView.swift	Adds MTP toggle + draft-token slider, and auto-reloads model on load-time setting changes with a reloading indicator.
Sources/SwiftLM/Server.swift	Adds `--mtp` / `--num-mtp-tokens` flags, sets env for MTP at load time, and routes generation through MTP when supported.
Sources/MLXInferenceCore/InferenceEngine.swift	Adds MTP generation path selection and publishes last-turn inference metrics.
Sources/MLXInferenceCore/GenerationConfig.swift	Persists new MTP settings in the shared generation configuration model.
scripts/profiling/profile_runner.py	Updates benchmark config matrix and TTFT reporting behavior for large-model profiling runs.
scripts/profiling/fp8_mtp_harness.py	New end-to-end harness to wait for FP8 shard availability, run benchmarks, and validate speedup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Enable MTP (Multi-Token Prediction) speculative decoding.
+    /// When true, the inference engine will use the model's internal MTP heads
+    /// to draft `numMTPTokens` candidate tokens per step, then verify them in
+    /// a single batched forward pass — targeting 2x+ throughput improvement.
+    /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
+    /// at model-load time). No-ops gracefully if the model does not conform to
+    /// `MTPLanguageModel`.
+    /// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
+    public var enableMTP: Bool


+    /// a single batched forward pass — targeting 2x+ throughput improvement.
+    /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
+    /// at model-load time). No-ops gracefully if the model does not conform to
+    /// `MTPLanguageModel`.
+    /// ⚠️ LOAD-TIME flag: changes take effect on the next model load.


+    public var prefillToksPerSec: Double
+    /// Decode throughput — tokens generated per second after the first token.
+    public var decodeToksPerSec: Double
+


+                    toggleRow(
+                        label: "MTP Speculative Decoding", icon: "bolt.horizontal.fill",
+                        isOn: mtpBinding,
+                        tint: SwiftBuddyTheme.accent,
+                        hint: "2x+ throughput using Multi-Token Prediction (auto-reloads model)"
                    )


+    # Baseline: no extras — establishes raw TPS floor on FP8 dequanted BF16
+    {"name": "Baseline",         "flags": ["--stream-experts"]},
+    # MTP Speculative — measures speculative gain
+    {"name": "MTP Speculative",  "flags": ["--mtp", "--stream-experts"]},
+    # MTP + TurboKV — target production config
+    {"name": "MTP + TurboQuant", "flags": ["--mtp", "--turbo-kv", "--stream-experts"]},


            if ok:
                results.append({
                    "config": config["name"],
                    "context": ctx_size,
-                    "ttft": f"{ttft:.2f}",
+                    "ttft": f"{ttft:.2f}" if ttft is not None else "N/A",
                    "tps": f"{tps:.2f}",
                    "static_mem": static_mem,
                    "os_ram": os_ram,
                    "gpu_alloc": f"{gpu_alloc:.1f}",
                    "gpu_in_use_peak": f"{peak_in_use:.1f}",
                })
-                print(f"  TTFT={ttft:.2f}s  TPS={tps:.2f}  OS_RAM={os_ram}GB  GPU_Alloc={gpu_alloc:.1f}GB  GPU_InUse(peak)={peak_in_use:.1f}GB")
+                ttft_str = f"{ttft:.2f}" if ttft is not None else "N/A"
+                print(f"  TTFT={ttft_str}s  TPS={tps:.2f}  OS_RAM={os_ram}GB  GPU_Alloc={gpu_alloc:.1f}GB  GPU_InUse(peak)={peak_in_use:.1f}GB")


+def find_snapshot_dir():
+    """Return the first (and only) snapshot hash directory."""
+    try:
+        snaps = os.listdir(HF_CACHE_PATH)
+        if snaps:
+            return os.path.join(HF_CACHE_PATH, snaps[0])


github-actions Bot added 4 commits May 5, 2026 10:23

test(profiling): Update harness tolerance for 35B SSD-streaming

16f9dd7

- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models. - Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Copilot AI review requested due to automatic review settings May 7, 2026 06:04

solderzzc added the enhancement New feature or request label May 7, 2026

Copilot started reviewing on behalf of solderzzc May 7, 2026 06:05 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(profiling): Update harness tolerance for 35B SSD-streaming#103

test(profiling): Update harness tolerance for 35B SSD-streaming#103
solderzzc wants to merge 4 commits intomainfrom
feat/mtp-harness-updates

solderzzc commented May 7, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

solderzzc commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

solderzzc commented May 7, 2026 •

edited

Loading