MiniMax-M3 mixed MXFP8-base + NVFP4-experts PTQ export#1806
MiniMax-M3 mixed MXFP8-base + NVFP4-experts PTQ export#1806chadvoegele wants to merge 4 commits into
Conversation
Extract the dataset/calib_size splitting and --cast_mxfp4_to_nvfp4 / --specdec_offline_dataset validation out of __main__ into a reusable prepare_args(), and let parse_args(argv) take an explicit argv. CLI behavior is unchanged (main(prepare_args(parse_args()))); this lets the MiniMax-M3 mixed-export wrapper drive hf_ptq in-process. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrapper that drives examples/llm_ptq/hf_ptq.py to produce the NVFP4 routed-expert export, then merges it onto the vendor MXFP8 base into a ModelOpt MIXED_PRECISION checkpoint: non-expert tensors pass through unchanged from the vendor MXFP8 ckpt, routed experts come from the NVFP4 export (input_scale forced to 1.0 by default), and the vendor config.json is preserved with only its quantization_config replaced. Unrecognized args are forwarded to hf_ptq; the 5 it controls are rejected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Re-applies Zhiyu Cheng's act_fn fix (d9bdf292c) onto main's refactored _fused_experts_wrapper_class. main's OMNIML-5003 (#1756) added non-gated NemotronH support but kept the act_fn requirement, so act_fn-less fused experts (e.g. MiniMaxM3VLExperts, which apply a custom gated activation between the two F.linear calls) were still skipped -- routed experts stayed unquantized and HF export raised NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls), so drop the act_fn requirement from the detection guard. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Co-Authored-By: Zhiyu Cheng <zhiyuc@nvidia.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds a 0.46 New Features entry for examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py (added in a047288). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughRemoves the ChangesFused-expert detection fix
MiniMax-M3 mixed MXFP8/NVFP4 checkpoint export
Sequence DiagramsequenceDiagram
participant CLI as User / CLI
participant wrapper as hf_ptq_mixed_mxfp8_nvfp4.main()
participant hf_ptq as hf_ptq.main()
participant nvfp4_copy as _copy_experts_from_nvfp4_export
participant mxfp8_copy as _copy_mxfp8_bf16_from_base
participant cfg_writer as _write_mixed_config
CLI->>wrapper: parse_args() → (args, passthrough)
wrapper->>hf_ptq: parse_args(argv) → prepare_args() → main()
Note over hf_ptq: Writes intermediate NVFP4 safetensors export
wrapper->>nvfp4_copy: read NVFP4 index, write per-layer expert shards, patch input_scale
nvfp4_copy-->>wrapper: updated weight_map + expert module names
wrapper->>mxfp8_copy: read vendor MXFP8 shards, write base shards
mxfp8_copy-->>wrapper: updated weight_map
wrapper->>cfg_writer: compute layer counts, write config.json + hf_quant_config.json
wrapper->>wrapper: write model.safetensors.index.json, copy ancillary files
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 1506-1508: The prepare_args() function warns that export_fmt is
being forced to "hf", but it does not actually modify args.export_fmt to enforce
this behavior. Add a statement after the warning that sets args.export_fmt =
"hf" to match the declared behavior in the warning message and prevent non-hf
values from propagating to main().
In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py`:
- Line 177: After the _selected_layers function call assigns the result to the
layers variable, add a validation check to ensure layers is not empty. If layers
is empty, raise an exception or exit the script with a clear error message
indicating that no routed-expert layers were found or matched. This prevents the
script from continuing with an incomplete checkpoint that would skip expert
weights during the _copy_mxfp8_bf16_from_base execution.
- Around line 167-170: The imports for torch, safe_open, and save_file are
currently placed inside a function (lines 167-170) instead of at the module
scope, which delays import failures until runtime and violates the repo's import
guidelines. Move these three imports to the top of the file with other
module-level imports. Additionally, address the same issue at lines 217-218
mentioned in the comment by moving any function-local imports there to the
module scope as well.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 19326279-46e4-44b2-9c40-34c1ca2e5631
📒 Files selected for processing (5)
CHANGELOG.rstexamples/llm_ptq/hf_ptq.pyexamples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.pymodelopt/torch/quantization/plugins/huggingface.pytests/unit/torch/quantization/plugins/test_fused_experts.py
| if args.export_fmt != "hf": | ||
| warnings.warn("Deprecated. --export_fmt forced to hf.") | ||
|
|
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
prepare_args() says --export_fmt is forced, but it never actually forces it
The warning text states forced behavior, but args.export_fmt is left unchanged. That can propagate a non-hf value into main().
Suggested fix
def prepare_args(args: argparse.Namespace) -> argparse.Namespace:
"""Apply the same post-parse normalization used by the CLI entrypoint."""
if args.export_fmt != "hf":
warnings.warn("Deprecated. --export_fmt forced to hf.")
+ args.export_fmt = "hf"🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/llm_ptq/hf_ptq.py` around lines 1506 - 1508, The prepare_args()
function warns that export_fmt is being forced to "hf", but it does not actually
modify args.export_fmt to enforce this behavior. Add a statement after the
warning that sets args.export_fmt = "hf" to match the declared behavior in the
warning message and prevent non-hf values from propagating to main().
| import torch | ||
| from safetensors import safe_open | ||
| from safetensors.torch import save_file | ||
|
|
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Move these imports to module scope
These function-local imports delay import failures until runtime and violate the repo import rule for Python files.
Suggested fix
import argparse
import json
import re
import shutil
import sys
from collections import defaultdict
from pathlib import Path
from typing import Any
+import torch
+from safetensors import safe_open
+from safetensors.torch import save_file
+
_THIS_DIR = Path(__file__).resolve().parent
_LLM_PTQ_DIR = _THIS_DIR.parent / "llm_ptq"
@@
def _copy_experts_from_nvfp4_export(
nvfp4: Path,
dst: Path,
layers_arg: str | None,
force_input_scale_one: bool,
) -> tuple[dict[str, str], list[str]]:
- import torch
- from safetensors import safe_open
- from safetensors.torch import save_file
@@
def _copy_mxfp8_bf16_from_base(
mxfp8: Path, dst: Path, mxfp8_map: dict[str, str], new_index: dict[str, str]
) -> None:
- from safetensors import safe_open
- from safetensors.torch import save_fileAs per coding guidelines, “Keep imports at the top of the file; place imports inside functions only when necessary … with explicit justification.”
Also applies to: 217-218
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py` around lines 167 - 170, The
imports for torch, safe_open, and save_file are currently placed inside a
function (lines 167-170) instead of at the module scope, which delays import
failures until runtime and violates the repo's import guidelines. Move these
three imports to the top of the file with other module-level imports.
Additionally, address the same issue at lines 217-218 mentioned in the comment
by moving any function-local imports there to the module scope as well.
Source: Coding guidelines
| match = _EXPERT_TENSOR_RE.match(key) | ||
| if match: | ||
| layer_keys[int(match.group("L"))].append(key) | ||
| layers = _selected_layers(layer_keys, layers_arg) |
There was a problem hiding this comment.
🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win
Fail fast when no routed-expert layers are selected/matched
If layers is empty, the script still continues, and _copy_mxfp8_bf16_from_base() skips all routed-expert tensors. That can produce an incomplete checkpoint missing expert weights.
Suggested fix
layers = _selected_layers(layer_keys, layers_arg)
+ if not layers:
+ raise ValueError(
+ "No routed-expert layers were selected/matched; refusing to export a mixed checkpoint "
+ "without routed-expert tensors."
+ )
print(f"[mixed] {len(layers)} MoE layers; experts NVFP4<-{nvfp4}, base MXFP8<-vendor")Also applies to: 225-226
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py` at line 177, After the
_selected_layers function call assigns the result to the layers variable, add a
validation check to ensure layers is not empty. If layers is empty, raise an
exception or exit the script with a clear error message indicating that no
routed-expert layers were found or matched. This prevents the script from
continuing with an incomplete checkpoint that would skip expert weights during
the _copy_mxfp8_bf16_from_base execution.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1806 +/- ##
===========================================
+ Coverage 62.89% 76.73% +13.84%
===========================================
Files 511 511
Lines 56683 56683
===========================================
+ Hits 35651 43498 +7847
+ Misses 21032 13185 -7847
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Is it possible to make this more general, not for m3 only?
What does this PR do?
Type of change: New example + bug fix
Adds an
hf_ptq.py-driven pipeline to export MiniMax-M3 as a mixed-precision checkpoint (vendor MXFP8 base + NVFP4 routed experts), plus the detection fix needed to quantize MiniMax-M3's fused experts at all.Commits
prepare_args()— extract the dataset/calib_size splitting and--cast_mxfp4_to_nvfp4/--specdec_offline_datasetvalidation out of__main__into a reusableprepare_args(), and letparse_args(argv)take an explicit argv. CLI behavior unchanged (main(prepare_args(parse_args()))); lets the wrapper drivehf_ptqin-process.examples/minimax_m3/hf_ptq_mixed_mxfp8_nvfp4.py. Drivesexamples/llm_ptq/hf_ptq.pyto quantize routed experts to NVFP4 from the BF16 source, then merges them onto the vendor MXFP8 base (non-expert tensors pass through unchanged) into a ModelOptMIXED_PRECISIONHF checkpoint. The vendorconfig.jsonis preserved with only itsquantization_configreplaced; routed-expertinput_scaleis forced to 1.0 by default. Unrecognized args are forwarded tohf_ptq.py; the 5 it controls are rejected._fused_experts_wrapper_classno longer requires anact_fnattribute. Modules likeMiniMaxM3VLExpertsapply a custom gated activation between the twoF.linearcalls instead of exposingact_fn, so they were silently skipped (routed experts left unquantized; HF export raisedNotImplementedError)._QuantFusedExpertsis activation-agnostic, so the requirement was unnecessary. Orthogonal to the non-gated NemotronH support added in [OMNIML-5003] Support non-gated fused MoE experts (NemotronH) in HF PTQ #1756. Original fix by @zhiyuc.Testing
tests/unit/torch/quantization/plugins/test_fused_experts.py— 43 passed (incl. the flippedtest_module_missing_act_fn_still_detected).🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
act_fnattribute, enabling proper quantization and export of modules with internally-applied gated activations.