[Quantization] Support NVFP4 for inline-swiglu fused MoE experts (MiniMax-M3 / MiniMaxM3VLExperts) by yifjiang · Pull Request #1719 · NVIDIA/Model-Optimizer

yifjiang · 2026-06-14T18:37:50Z

What does this PR do?

Type of change: Bug fix (enables NVFP4 quantization of a model that previously produced an empty/failing result)

Overview: Relax _is_fused_experts_module so it no longer requires an act_fn attribute, which adds NVFP4 support for MiniMax-M3 (minimax_m3_vl, a ~428B MoE VLM) and any other fused-MoE model that applies its gating activation inline.

Root cause

MiniMax-M3's routed-experts container transformers.models.minimax_m3_vl.modeling_minimax_m3_vl.MiniMaxM3VLExperts is a standard transformers-5.x fused-experts module — 3-D gate_up_proj/down_proj nn.Parameters + num_experts, with a forward that loops experts doing F.linear(x, gate_up_proj[idx]) then F.linear(..., down_proj[idx]). That is exactly the pattern _QuantFusedExperts handles. But MiniMaxM3VLExperts applies SwiGLU inline (via swiglu_alpha/swiglu_limit) and has no act_fn submodule, so _is_fused_experts_module returned False. The experts were never wrapped → nvfp4_experts_only enabled zero expert weight quantizers → export then raised:

NotImplementedError: MoE model with experts type 'MiniMaxM3VLExperts' is not supported in export.

Fix

Drop the act_fn requirement from the structural detector. act_fn is never consumed on this path: _QuantFusedExperts only intercepts torch.nn.functional.linear (the activation runs in the model's own forward), and _export_fused_experts operates purely on weight tensors + quantizer scales. So no export-side change is needed — once the experts are wrapped, calibration and the existing unified-export fused path work as-is. The remaining gate_up_proj+down_proj (3-D) + num_experts signature is distinctive, and models needing custom forwards (Llama4, GptOss, DBRX, Qwen3-VL-MoE) are excluded before this generic detector by their explicit registrations (the registry guard, not act_fn).

Usage (MiniMax-M3)

python examples/llm_ptq/hf_ptq.py --pyt_ckpt_path <MiniMax-M3 BF16> \
  --qformat nvfp4_experts_only --kv_cache_qformat fp8_cast \
  --attn_implementation eager --calib_size 512 --dataset cnn_dailymail \
  --export_path <out>

Load via transformers ≥5.12 native support (no --trust_remote_code).

Usage / testing

New/updated CPU unit tests in tests/unit/torch/quantization/plugins/test_fused_experts.py:
- Flipped test_module_missing_act_fn_not_detected → test_module_missing_act_fn_still_detected (it previously asserted the now-incorrect behavior and would otherwise fail CI).
- Added _SyntheticFusedExpertsInlineSwiglu (no act_fn, inline SwiGLU, mirroring MiniMaxM3VLExperts) + a detection test and an end-to-end convert/calibrate test asserting every per-expert weight/input quantizer gets an amax.
Validated end-to-end on a GB200 node: MiniMax-M3 (from the BF16 source) → 14,592 expert weight quantizers enabled (57 layers × 128 experts × 2), 260 GB NVFP4 unified-HF checkpoint, validation gate passed, wikitext-2 perplexity BF16 5.083 → NVFP4 5.420 (+6.6%).

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible? Yes — strictly widens detection; act_fn-bearing fused experts still match.
Did you write any new necessary tests? Yes (see above).
Did you add or update any necessary documentation? Yes — CHANGELOG.rst (0.46) and the examples/llm_ptq/README.md support matrix (added a MiniMax M3 row + footnote).
Did you update Changelog? Yes.

Additional Information

Security: MiniMax-M3 loads via transformers ≥5.12 native support — no trust_remote_code needed.
Caveat (coverage, not correctness): pyproject.toml pins transformers >=4.56,<5.10, but M3 needs 5.12, so M3 itself is not exercisable in CI. The fix is independent of the pin and this PR does not bump it; the synthetic inline-SwiGLU tests are the automated guard, and the end-to-end behavior was validated on GB200 (numbers above).

…iMax-M3) MiniMaxM3VLExperts is a standard transformers 5.x fused-experts container (3-D gate_up_proj/down_proj + num_experts) but applies SwiGLU inline and has no act_fn submodule, so _is_fused_experts_module returned False -> the experts were never wrapped -> nvfp4_experts_only enabled zero expert quantizers and export raised NotImplementedError("...experts type 'MiniMaxM3VLExperts'..."). Drop the act_fn requirement from the detector. _QuantFusedExperts only intercepts F.linear and never reads act_fn, and _export_fused_experts is weight-only, so no export change is needed once detection wraps the experts. Models needing custom forwards (Llama4, GptOss, DBRX, Qwen3-VL-MoE) remain excluded earlier via their explicit registrations. Flip the now-incorrect test_module_missing_act_fn test and add an inline-SwiGLU synthetic experts detection + calibration test. Add a CHANGELOG entry and a MiniMax M3 row to the llm_ptq support matrix. Validated end-to-end on GB200: 14,592 expert weight quantizers enabled, 260 GB NVFP4 checkpoint, wikitext-2 perplexity 5.083 -> 5.420 (+6.6%). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Yifan Jiang <19356972+yifjiang@users.noreply.github.com>

copy-pr-bot · 2026-06-14T18:37:53Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-14T18:37:57Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 3439b7a9-7a84-4db0-9c9a-72e80cef659c

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Quantization] Support NVFP4 for inline-swiglu fused MoE experts (MiniMax-M3 / MiniMaxM3VLExperts)#1719

[Quantization] Support NVFP4 for inline-swiglu fused MoE experts (MiniMax-M3 / MiniMaxM3VLExperts)#1719
yifjiang wants to merge 1 commit into
NVIDIA:mainfrom
yifjiang:fix/minimax-m3-fused-experts-nvfp4

yifjiang commented Jun 14, 2026

Uh oh!

copy-pr-bot Bot commented Jun 14, 2026

Uh oh!

coderabbitai Bot commented Jun 14, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yifjiang commented Jun 14, 2026

What does this PR do?

Root cause

Fix

Usage (MiniMax-M3)

Usage / testing

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 14, 2026

Uh oh!

coderabbitai Bot commented Jun 14, 2026

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant