test(profiling): Update harness tolerance for 35B SSD-streaming#103
Open
test(profiling): Update harness tolerance for 35B SSD-streaming#103
Conversation
- Add enableMTP (Bool) and numMTPTokens (Int) to GenerationConfig - InferenceEngine.generate() routes to generateMTP() when both config.enableMTP is true and the loaded model conforms to MTPLanguageModel; graceful fallback to standard path otherwise
- Added --mtp and --num-mtp-tokens CLI flags to Server.swift - Automatically injects SWIFTLM_MTP_ENABLE=1 into environment during startup if --mtp is specified - Exposed MTP configuration to ServerConfig and startup logs - Refactored MLXLMCommon.generate invocations to call generateMTP() when MTP is enabled and the model conforms to MTPLanguageModel
- Added 'MTP Speculative Decoding' toggle to the Advanced Engine settings pane. - Added a dynamic slider to configure the number of MTP draft tokens per round (1-5). - Integrated MTP toggle with the engine auto-reloading mechanism, similar to SSD Streaming.
- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models. - Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.
Contributor
There was a problem hiding this comment.
Pull request overview
This PR expands SwiftLM/SwiftBuddy support for MTP (Multi-Token Prediction) speculative decoding and updates profiling tooling to better benchmark large FP8 MoE models.
Changes:
- Added MTP enablement + token/round configuration paths across the SwiftLM CLI server and SwiftBuddy settings UI (with auto-reload for load-time toggles).
- Added inference-side MTP generation selection and exposed lightweight last-turn performance metrics (TTFT/prefill/decode throughput).
- Updated profiling configuration/output handling and introduced a dedicated FP8 MTP benchmark harness script.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| SwiftBuddy/SwiftBuddy/Views/SettingsView.swift | Adds MTP toggle + draft-token slider, and auto-reloads model on load-time setting changes with a reloading indicator. |
| Sources/SwiftLM/Server.swift | Adds --mtp / --num-mtp-tokens flags, sets env for MTP at load time, and routes generation through MTP when supported. |
| Sources/MLXInferenceCore/InferenceEngine.swift | Adds MTP generation path selection and publishes last-turn inference metrics. |
| Sources/MLXInferenceCore/GenerationConfig.swift | Persists new MTP settings in the shared generation configuration model. |
| scripts/profiling/profile_runner.py | Updates benchmark config matrix and TTFT reporting behavior for large-model profiling runs. |
| scripts/profiling/fp8_mtp_harness.py | New end-to-end harness to wait for FP8 shard availability, run benchmarks, and validate speedup. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+53
to
+61
| /// Enable MTP (Multi-Token Prediction) speculative decoding. | ||
| /// When true, the inference engine will use the model's internal MTP heads | ||
| /// to draft `numMTPTokens` candidate tokens per step, then verify them in | ||
| /// a single batched forward pass — targeting 2x+ throughput improvement. | ||
| /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1 | ||
| /// at model-load time). No-ops gracefully if the model does not conform to | ||
| /// `MTPLanguageModel`. | ||
| /// ⚠️ LOAD-TIME flag: changes take effect on the next model load. | ||
| public var enableMTP: Bool |
Comment on lines
+56
to
+60
| /// a single batched forward pass — targeting 2x+ throughput improvement. | ||
| /// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1 | ||
| /// at model-load time). No-ops gracefully if the model does not conform to | ||
| /// `MTPLanguageModel`. | ||
| /// ⚠️ LOAD-TIME flag: changes take effect on the next model load. |
| public var prefillToksPerSec: Double | ||
| /// Decode throughput — tokens generated per second after the first token. | ||
| public var decodeToksPerSec: Double | ||
|
|
Comment on lines
+351
to
356
| toggleRow( | ||
| label: "MTP Speculative Decoding", icon: "bolt.horizontal.fill", | ||
| isOn: mtpBinding, | ||
| tint: SwiftBuddyTheme.accent, | ||
| hint: "2x+ throughput using Multi-Token Prediction (auto-reloads model)" | ||
| ) |
Comment on lines
+14
to
+19
| # Baseline: no extras — establishes raw TPS floor on FP8 dequanted BF16 | ||
| {"name": "Baseline", "flags": ["--stream-experts"]}, | ||
| # MTP Speculative — measures speculative gain | ||
| {"name": "MTP Speculative", "flags": ["--mtp", "--stream-experts"]}, | ||
| # MTP + TurboKV — target production config | ||
| {"name": "MTP + TurboQuant", "flags": ["--mtp", "--turbo-kv", "--stream-experts"]}, |
Comment on lines
366
to
+378
| if ok: | ||
| results.append({ | ||
| "config": config["name"], | ||
| "context": ctx_size, | ||
| "ttft": f"{ttft:.2f}", | ||
| "ttft": f"{ttft:.2f}" if ttft is not None else "N/A", | ||
| "tps": f"{tps:.2f}", | ||
| "static_mem": static_mem, | ||
| "os_ram": os_ram, | ||
| "gpu_alloc": f"{gpu_alloc:.1f}", | ||
| "gpu_in_use_peak": f"{peak_in_use:.1f}", | ||
| }) | ||
| print(f" TTFT={ttft:.2f}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB") | ||
| ttft_str = f"{ttft:.2f}" if ttft is not None else "N/A" | ||
| print(f" TTFT={ttft_str}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB") |
Comment on lines
+43
to
+48
| def find_snapshot_dir(): | ||
| """Return the first (and only) snapshot hash directory.""" | ||
| try: | ||
| snaps = os.listdir(HF_CACHE_PATH) | ||
| if snaps: | ||
| return os.path.join(HF_CACHE_PATH, snaps[0]) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Resolves #102