Skip to content

MiniMax-M3 mixed MXFP8-base + NVFP4-experts PTQ export#1806

Open
chadvoegele wants to merge 4 commits into
mainfrom
cvoegele/minimax-m3-mixed-mxfp8-nvfp4
Open

MiniMax-M3 mixed MXFP8-base + NVFP4-experts PTQ export#1806
chadvoegele wants to merge 4 commits into
mainfrom
cvoegele/minimax-m3-mixed-mxfp8-nvfp4

Conversation

@chadvoegele

@chadvoegele chadvoegele commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

What does this PR do?

Type of change: New example + bug fix

Adds an hf_ptq.py-driven pipeline to export MiniMax-M3 as a mixed-precision checkpoint (vendor MXFP8 base + NVFP4 routed experts), plus the detection fix needed to quantize MiniMax-M3's fused experts at all.

Commits

  • hf_ptq: factor post-parse normalization into prepare_args() — extract the dataset/calib_size splitting and --cast_mxfp4_to_nvfp4 / --specdec_offline_dataset validation out of __main__ into a reusable prepare_args(), and let parse_args(argv) take an explicit argv. CLI behavior unchanged (main(prepare_args(parse_args()))); lets the wrapper drive hf_ptq in-process.
  • minimax_m3: hf_ptq-driven mixed MXFP8-base + NVFP4-experts exporter — new examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py. Drives examples/llm_ptq/hf_ptq.py to quantize routed experts to NVFP4 from the BF16 source, then merges them onto the vendor MXFP8 base (non-expert tensors pass through unchanged) into a ModelOpt MIXED_PRECISION HF checkpoint. The vendor config.json is preserved with only its quantization_config replaced; routed-expert input_scale is forced to 1.0 by default. Unrecognized args are forwarded to hf_ptq.py; the 5 it controls are rejected.
  • fix(quantization): detect fused MoE experts without act_fn (MiniMax-M3)_fused_experts_wrapper_class no longer requires an act_fn attribute. Modules like MiniMaxM3VLExperts apply a custom gated activation between the two F.linear calls instead of exposing act_fn, so they were silently skipped (routed experts left unquantized; HF export raised NotImplementedError). _QuantFusedExperts is activation-agnostic, so the requirement was unnecessary. Orthogonal to the non-gated NemotronH support added in [OMNIML-5003] Support non-gated fused MoE experts (NemotronH) in HF PTQ #1756. Original fix by @zhiyuc.

Testing

  • tests/unit/torch/quantization/plugins/test_fused_experts.py — 43 passed (incl. the flipped test_module_missing_act_fn_still_detected).

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Added example script for exporting MiniMax-M3 models with mixed-precision quantization (MXFP8 base + NVFP4 routed experts).
  • Bug Fixes

    • Fixed fused MoE expert auto-detection to work without requiring an act_fn attribute, enabling proper quantization and export of modules with internally-applied gated activations.

chadvoegele and others added 4 commits June 23, 2026 14:06
Extract the dataset/calib_size splitting and --cast_mxfp4_to_nvfp4 /
--specdec_offline_dataset validation out of __main__ into a reusable
prepare_args(), and let parse_args(argv) take an explicit argv. CLI
behavior is unchanged (main(prepare_args(parse_args()))); this lets the
MiniMax-M3 mixed-export wrapper drive hf_ptq in-process.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrapper that drives examples/llm_ptq/hf_ptq.py to produce the NVFP4
routed-expert export, then merges it onto the vendor MXFP8 base into a
ModelOpt MIXED_PRECISION checkpoint: non-expert tensors pass through
unchanged from the vendor MXFP8 ckpt, routed experts come from the NVFP4
export (input_scale forced to 1.0 by default), and the vendor config.json
is preserved with only its quantization_config replaced. Unrecognized
args are forwarded to hf_ptq; the 5 it controls are rejected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-applies Zhiyu Cheng's act_fn fix (d9bdf292c) onto main's refactored
_fused_experts_wrapper_class. main's OMNIML-5003 (#1756) added non-gated
NemotronH support but kept the act_fn requirement, so act_fn-less fused
experts (e.g. MiniMaxM3VLExperts, which apply a custom gated activation
between the two F.linear calls) were still skipped -- routed experts
stayed unquantized and HF export raised NotImplementedError.

_QuantFusedExperts is activation-agnostic (it only intercepts the two
F.linear calls), so drop the act_fn requirement from the detection guard.
Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3.

Co-Authored-By: Zhiyu Cheng <zhiyuc@nvidia.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a 0.46 New Features entry for examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py
(added in a047288).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@chadvoegele chadvoegele requested review from a team as code owners June 23, 2026 19:43
@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

Removes the act_fn attribute requirement from fused MoE expert auto-detection in huggingface.py. Refactors hf_ptq.py to expose parse_args(argv) and a new prepare_args() for programmatic invocation. Adds examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py, a new wrapper that produces a mixed MXFP8/NVFP4 MiniMax-M3 checkpoint.

Changes

Fused-expert detection fix

Layer / File(s) Summary
Remove act_fn guard from fused-expert detection
modelopt/torch/quantization/plugins/huggingface.py, tests/unit/torch/quantization/plugins/test_fused_experts.py
Drops act_fn from the _fused_experts_wrapper_class validation guard and updates its docstring; flips the corresponding unit test assertion from False to True for modules without act_fn.

MiniMax-M3 mixed MXFP8/NVFP4 checkpoint export

Layer / File(s) Summary
hf_ptq.py programmatic API
examples/llm_ptq/hf_ptq.py
parse_args gains an optional argv parameter; post-parse normalization and validation (CSV splitting, export_fmt deprecation, cast_mxfp4_to_nvfp4 constraints) is extracted into a new prepare_args function; __main__ calls main(prepare_args(parse_args())).
Wrapper CLI, passthrough args, and argv construction
examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py
Module docstring, sys.path adjustment, hf_ptq import, regex expert-key constants, and _has_option/_validate_passthrough_args/_hf_ptq_argv helpers that build a controlled NVFP4 command line while forwarding unrecognized args; parse_args() returns (args, passthrough).
Index loading, expert classification, and quant config builders
examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py
_load_index, _is_routed_expert_tensor, _selected_layers, _build_quantized_layers, and _build_quant_config identify routed-expert tensors and generate the per-module MXFP8/NVFP4 quantization configuration map.
Shard copying, config writing, and orchestration
examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py
_copy_experts_from_nvfp4_export writes NVFP4 expert shards and optionally forces input_scale to 1.0; _copy_mxfp8_bf16_from_base copies non-expert tensors from vendor MXFP8 shards; _write_mixed_config emits config.json and hf_quant_config.json; _export_mixed_mxfp8_nvfp4 orchestrates all steps and copies ancillary files.
main() entrypoint and CHANGELOG
examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py, CHANGELOG.rst
main() chains the full pipeline: computes intermediate NVFP4 path, guards against existing output, runs hf_ptq, then runs the mixed export. CHANGELOG documents both the new example and the act_fn bug fix.

Sequence Diagram

sequenceDiagram
  participant CLI as User / CLI
  participant wrapper as hf_ptq_mixed_mxfp8_nvfp4.main()
  participant hf_ptq as hf_ptq.main()
  participant nvfp4_copy as _copy_experts_from_nvfp4_export
  participant mxfp8_copy as _copy_mxfp8_bf16_from_base
  participant cfg_writer as _write_mixed_config

  CLI->>wrapper: parse_args() → (args, passthrough)
  wrapper->>hf_ptq: parse_args(argv) → prepare_args() → main()
  Note over hf_ptq: Writes intermediate NVFP4 safetensors export
  wrapper->>nvfp4_copy: read NVFP4 index, write per-layer expert shards, patch input_scale
  nvfp4_copy-->>wrapper: updated weight_map + expert module names
  wrapper->>mxfp8_copy: read vendor MXFP8 shards, write base shards
  mxfp8_copy-->>wrapper: updated weight_map
  wrapper->>cfg_writer: compute layer counts, write config.json + hf_quant_config.json
  wrapper->>wrapper: write model.safetensors.index.json, copy ancillary files
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1756: Directly related — both PRs modify _fused_experts_wrapper_class in huggingface.py (and its tests) to expand fused-expert layout recognition; this PR removes the act_fn check while that PR added up_proj-only fused-expert support.

Suggested reviewers

  • ChenhanYu
  • sychen52
  • cjluo-nv
  • kevalmorabia97
  • Edwardf0t1
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: introducing MiniMax-M3 mixed-precision PTQ export with MXFP8 base and NVFP4 experts, which aligns with the primary purpose of this PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns from SECURITY.md found: no unsafe torch.load/numpy.load, no hardcoded trust_remote_code, no eval/exec on external input, no # nosec comments, no new non-permissive depende...

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch cvoegele/minimax-m3-mixed-mxfp8-nvfp4

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1806/

Built to branch gh-pages at 2026-06-23 19:46 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1506-1508: The prepare_args() function warns that export_fmt is
being forced to "hf", but it does not actually modify args.export_fmt to enforce
this behavior. Add a statement after the warning that sets args.export_fmt =
"hf" to match the declared behavior in the warning message and prevent non-hf
values from propagating to main().

In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py`:
- Line 177: After the _selected_layers function call assigns the result to the
layers variable, add a validation check to ensure layers is not empty. If layers
is empty, raise an exception or exit the script with a clear error message
indicating that no routed-expert layers were found or matched. This prevents the
script from continuing with an incomplete checkpoint that would skip expert
weights during the _copy_mxfp8_bf16_from_base execution.
- Around line 167-170: The imports for torch, safe_open, and save_file are
currently placed inside a function (lines 167-170) instead of at the module
scope, which delays import failures until runtime and violates the repo's import
guidelines. Move these three imports to the top of the file with other
module-level imports. Additionally, address the same issue at lines 217-218
mentioned in the comment by moving any function-local imports there to the
module scope as well.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 19326279-46e4-44b2-9c40-34c1ca2e5631

📥 Commits

Reviewing files that changed from the base of the PR and between 37dbbda and ea1ea09.

📒 Files selected for processing (5)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py
  • modelopt/torch/quantization/plugins/huggingface.py
  • tests/unit/torch/quantization/plugins/test_fused_experts.py

Comment on lines +1506 to +1508
if args.export_fmt != "hf":
warnings.warn("Deprecated. --export_fmt forced to hf.")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

prepare_args() says --export_fmt is forced, but it never actually forces it

The warning text states forced behavior, but args.export_fmt is left unchanged. That can propagate a non-hf value into main().

Suggested fix
 def prepare_args(args: argparse.Namespace) -> argparse.Namespace:
     """Apply the same post-parse normalization used by the CLI entrypoint."""
     if args.export_fmt != "hf":
         warnings.warn("Deprecated. --export_fmt forced to hf.")
+        args.export_fmt = "hf"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 1506 - 1508, The prepare_args()
function warns that export_fmt is being forced to "hf", but it does not actually
modify args.export_fmt to enforce this behavior. Add a statement after the
warning that sets args.export_fmt = "hf" to match the declared behavior in the
warning message and prevent non-hf values from propagating to main().

Comment on lines +167 to +170
import torch
from safetensors import safe_open
from safetensors.torch import save_file

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move these imports to module scope

These function-local imports delay import failures until runtime and violate the repo import rule for Python files.

Suggested fix
 import argparse
 import json
 import re
 import shutil
 import sys
 from collections import defaultdict
 from pathlib import Path
 from typing import Any
 
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+
 _THIS_DIR = Path(__file__).resolve().parent
 _LLM_PTQ_DIR = _THIS_DIR.parent / "llm_ptq"
@@
 def _copy_experts_from_nvfp4_export(
     nvfp4: Path,
     dst: Path,
     layers_arg: str | None,
     force_input_scale_one: bool,
 ) -> tuple[dict[str, str], list[str]]:
-    import torch
-    from safetensors import safe_open
-    from safetensors.torch import save_file
@@
 def _copy_mxfp8_bf16_from_base(
     mxfp8: Path, dst: Path, mxfp8_map: dict[str, str], new_index: dict[str, str]
 ) -> None:
-    from safetensors import safe_open
-    from safetensors.torch import save_file

As per coding guidelines, “Keep imports at the top of the file; place imports inside functions only when necessary … with explicit justification.”

Also applies to: 217-218

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py` around lines 167 - 170, The
imports for torch, safe_open, and save_file are currently placed inside a
function (lines 167-170) instead of at the module scope, which delays import
failures until runtime and violates the repo's import guidelines. Move these
three imports to the top of the file with other module-level imports.
Additionally, address the same issue at lines 217-218 mentioned in the comment
by moving any function-local imports there to the module scope as well.

Source: Coding guidelines

match = _EXPERT_TENSOR_RE.match(key)
if match:
layer_keys[int(match.group("L"))].append(key)
layers = _selected_layers(layer_keys, layers_arg)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fail fast when no routed-expert layers are selected/matched

If layers is empty, the script still continues, and _copy_mxfp8_bf16_from_base() skips all routed-expert tensors. That can produce an incomplete checkpoint missing expert weights.

Suggested fix
     layers = _selected_layers(layer_keys, layers_arg)
+    if not layers:
+        raise ValueError(
+            "No routed-expert layers were selected/matched; refusing to export a mixed checkpoint "
+            "without routed-expert tensors."
+        )
     print(f"[mixed] {len(layers)} MoE layers; experts NVFP4<-{nvfp4}, base MXFP8<-vendor")

Also applies to: 225-226

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py` at line 177, After the
_selected_layers function call assigns the result to the layers variable, add a
validation check to ensure layers is not empty. If layers is empty, raise an
exception or exit the script with a clear error message indicating that no
routed-expert layers were found or matched. This prevents the script from
continuing with an incomplete checkpoint that would skip expert weights during
the _copy_mxfp8_bf16_from_base execution.

@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.73%. Comparing base (c81210f) to head (ea1ea09).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #1806       +/-   ##
===========================================
+ Coverage   62.89%   76.73%   +13.84%     
===========================================
  Files         511      511               
  Lines       56683    56683               
===========================================
+ Hits        35651    43498     +7847     
+ Misses      21032    13185     -7847     
Flag Coverage Δ
examples 42.11% <100.00%> (+4.09%) ⬆️
gpu 57.87% <100.00%> (+37.30%) ⬆️
regression 14.72% <0.00%> (+0.05%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this more general, not for m3 only?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants