[Quantization] Support NVFP4 for inline-swiglu fused MoE experts (MiniMax-M3 / MiniMaxM3VLExperts)#1719
Draft
yifjiang wants to merge 1 commit into
Draft
Conversation
…iMax-M3)
MiniMaxM3VLExperts is a standard transformers 5.x fused-experts container
(3-D gate_up_proj/down_proj + num_experts) but applies SwiGLU inline and has
no act_fn submodule, so _is_fused_experts_module returned False -> the experts
were never wrapped -> nvfp4_experts_only enabled zero expert quantizers and
export raised NotImplementedError("...experts type 'MiniMaxM3VLExperts'...").
Drop the act_fn requirement from the detector. _QuantFusedExperts only
intercepts F.linear and never reads act_fn, and _export_fused_experts is
weight-only, so no export change is needed once detection wraps the experts.
Models needing custom forwards (Llama4, GptOss, DBRX, Qwen3-VL-MoE) remain
excluded earlier via their explicit registrations.
Flip the now-incorrect test_module_missing_act_fn test and add an inline-SwiGLU
synthetic experts detection + calibration test. Add a CHANGELOG entry and a
MiniMax M3 row to the llm_ptq support matrix.
Validated end-to-end on GB200: 14,592 expert weight quantizers enabled,
260 GB NVFP4 checkpoint, wikitext-2 perplexity 5.083 -> 5.420 (+6.6%).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>
Contributor
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Type of change: Bug fix (enables NVFP4 quantization of a model that previously produced an empty/failing result)
Overview: Relax
_is_fused_experts_moduleso it no longer requires anact_fnattribute, which adds NVFP4 support for MiniMax-M3 (minimax_m3_vl, a ~428B MoE VLM) and any other fused-MoE model that applies its gating activation inline.Root cause
MiniMax-M3's routed-experts container
transformers.models.minimax_m3_vl.modeling_minimax_m3_vl.MiniMaxM3VLExpertsis a standard transformers-5.x fused-experts module — 3-Dgate_up_proj/down_projnn.Parameters +num_experts, with a forward that loops experts doingF.linear(x, gate_up_proj[idx])thenF.linear(..., down_proj[idx]). That is exactly the pattern_QuantFusedExpertshandles. ButMiniMaxM3VLExpertsapplies SwiGLU inline (viaswiglu_alpha/swiglu_limit) and has noact_fnsubmodule, so_is_fused_experts_modulereturnedFalse. The experts were never wrapped →nvfp4_experts_onlyenabled zero expert weight quantizers → export then raised:Fix
Drop the
act_fnrequirement from the structural detector.act_fnis never consumed on this path:_QuantFusedExpertsonly interceptstorch.nn.functional.linear(the activation runs in the model's own forward), and_export_fused_expertsoperates purely on weight tensors + quantizer scales. So no export-side change is needed — once the experts are wrapped, calibration and the existing unified-export fused path work as-is. The remaininggate_up_proj+down_proj(3-D) +num_expertssignature is distinctive, and models needing custom forwards (Llama4, GptOss, DBRX, Qwen3-VL-MoE) are excluded before this generic detector by their explicit registrations (the registry guard, notact_fn).Usage (MiniMax-M3)
Load via transformers ≥5.12 native support (no
--trust_remote_code).Usage / testing
tests/unit/torch/quantization/plugins/test_fused_experts.py:test_module_missing_act_fn_not_detected→test_module_missing_act_fn_still_detected(it previously asserted the now-incorrect behavior and would otherwise fail CI)._SyntheticFusedExpertsInlineSwiglu(noact_fn, inline SwiGLU, mirroringMiniMaxM3VLExperts) + a detection test and an end-to-end convert/calibrate test asserting every per-expert weight/input quantizer gets anamax.Before your PR is "Ready for review"
act_fn-bearing fused experts still match.CHANGELOG.rst(0.46) and theexamples/llm_ptq/README.mdsupport matrix (added a MiniMax M3 row + footnote).Additional Information
trust_remote_codeneeded.pyproject.tomlpinstransformers >=4.56,<5.10, but M3 needs 5.12, so M3 itself is not exercisable in CI. The fix is independent of the pin and this PR does not bump it; the synthetic inline-SwiGLU tests are the automated guard, and the end-to-end behavior was validated on GB200 (numbers above).