Skip to content

test(profiling): Update harness tolerance for 35B SSD-streaming#103

Open
solderzzc wants to merge 4 commits intomainfrom
feat/mtp-harness-updates
Open

test(profiling): Update harness tolerance for 35B SSD-streaming#103
solderzzc wants to merge 4 commits intomainfrom
feat/mtp-harness-updates

Conversation

@solderzzc
Copy link
Copy Markdown
Member

@solderzzc solderzzc commented May 7, 2026

  • Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models.
  • Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.

Resolves #102

github-actions Bot added 4 commits May 5, 2026 10:23
- Add enableMTP (Bool) and numMTPTokens (Int) to GenerationConfig
- InferenceEngine.generate() routes to generateMTP() when both
  config.enableMTP is true and the loaded model conforms to
  MTPLanguageModel; graceful fallback to standard path otherwise
- Added --mtp and --num-mtp-tokens CLI flags to Server.swift
- Automatically injects SWIFTLM_MTP_ENABLE=1 into environment during startup if --mtp is specified
- Exposed MTP configuration to ServerConfig and startup logs
- Refactored MLXLMCommon.generate invocations to call generateMTP() when MTP is enabled and the model conforms to MTPLanguageModel
- Added 'MTP Speculative Decoding' toggle to the Advanced Engine settings pane.
- Added a dynamic slider to configure the number of MTP draft tokens per round (1-5).
- Integrated MTP toggle with the engine auto-reloading mechanism, similar to SSD Streaming.
- Increase server initialization timeout to 300s in profile_runner.py for massive FP8 models.
- Introduce fp8_mtp_harness.py test suite for automated speculative decoding validation.
Copilot AI review requested due to automatic review settings May 7, 2026 06:04
@solderzzc solderzzc added the enhancement New feature or request label May 7, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands SwiftLM/SwiftBuddy support for MTP (Multi-Token Prediction) speculative decoding and updates profiling tooling to better benchmark large FP8 MoE models.

Changes:

  • Added MTP enablement + token/round configuration paths across the SwiftLM CLI server and SwiftBuddy settings UI (with auto-reload for load-time toggles).
  • Added inference-side MTP generation selection and exposed lightweight last-turn performance metrics (TTFT/prefill/decode throughput).
  • Updated profiling configuration/output handling and introduced a dedicated FP8 MTP benchmark harness script.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
SwiftBuddy/SwiftBuddy/Views/SettingsView.swift Adds MTP toggle + draft-token slider, and auto-reloads model on load-time setting changes with a reloading indicator.
Sources/SwiftLM/Server.swift Adds --mtp / --num-mtp-tokens flags, sets env for MTP at load time, and routes generation through MTP when supported.
Sources/MLXInferenceCore/InferenceEngine.swift Adds MTP generation path selection and publishes last-turn inference metrics.
Sources/MLXInferenceCore/GenerationConfig.swift Persists new MTP settings in the shared generation configuration model.
scripts/profiling/profile_runner.py Updates benchmark config matrix and TTFT reporting behavior for large-model profiling runs.
scripts/profiling/fp8_mtp_harness.py New end-to-end harness to wait for FP8 shard availability, run benchmarks, and validate speedup.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +53 to +61
/// Enable MTP (Multi-Token Prediction) speculative decoding.
/// When true, the inference engine will use the model's internal MTP heads
/// to draft `numMTPTokens` candidate tokens per step, then verify them in
/// a single batched forward pass — targeting 2x+ throughput improvement.
/// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
/// at model-load time). No-ops gracefully if the model does not conform to
/// `MTPLanguageModel`.
/// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
public var enableMTP: Bool
Comment on lines +56 to +60
/// a single batched forward pass — targeting 2x+ throughput improvement.
/// Requires a checkpoint that retains `mtp.*` weights (set SWIFTLM_MTP_ENABLE=1
/// at model-load time). No-ops gracefully if the model does not conform to
/// `MTPLanguageModel`.
/// ⚠️ LOAD-TIME flag: changes take effect on the next model load.
public var prefillToksPerSec: Double
/// Decode throughput — tokens generated per second after the first token.
public var decodeToksPerSec: Double

Comment on lines +351 to 356
toggleRow(
label: "MTP Speculative Decoding", icon: "bolt.horizontal.fill",
isOn: mtpBinding,
tint: SwiftBuddyTheme.accent,
hint: "2x+ throughput using Multi-Token Prediction (auto-reloads model)"
)
Comment on lines +14 to +19
# Baseline: no extras — establishes raw TPS floor on FP8 dequanted BF16
{"name": "Baseline", "flags": ["--stream-experts"]},
# MTP Speculative — measures speculative gain
{"name": "MTP Speculative", "flags": ["--mtp", "--stream-experts"]},
# MTP + TurboKV — target production config
{"name": "MTP + TurboQuant", "flags": ["--mtp", "--turbo-kv", "--stream-experts"]},
Comment on lines 366 to +378
if ok:
results.append({
"config": config["name"],
"context": ctx_size,
"ttft": f"{ttft:.2f}",
"ttft": f"{ttft:.2f}" if ttft is not None else "N/A",
"tps": f"{tps:.2f}",
"static_mem": static_mem,
"os_ram": os_ram,
"gpu_alloc": f"{gpu_alloc:.1f}",
"gpu_in_use_peak": f"{peak_in_use:.1f}",
})
print(f" TTFT={ttft:.2f}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB")
ttft_str = f"{ttft:.2f}" if ttft is not None else "N/A"
print(f" TTFT={ttft_str}s TPS={tps:.2f} OS_RAM={os_ram}GB GPU_Alloc={gpu_alloc:.1f}GB GPU_InUse(peak)={peak_in_use:.1f}GB")
Comment on lines +43 to +48
def find_snapshot_dir():
"""Return the first (and only) snapshot hash directory."""
try:
snaps = os.listdir(HF_CACHE_PATH)
if snaps:
return os.path.join(HF_CACHE_PATH, snaps[0])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: Integrate MTP Speculative Decoding (MTPLX-style) for 2x+ Speedup

2 participants