feat(export): quant-aware reverse weight conversion for unified HF export by Edwardf0t1 · Pull Request #1833 · NVIDIA/Model-Optimizer

Edwardf0t1 · 2026-06-26T00:20:13Z

What

ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-time conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered model/language_model prefix), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.

Observed concretely on MiniMax-M3: nvidia/MiniMax-M3-NVFP4-v1 emitted the converted/fused names (model.language_model.*, mlp.experts.*, fused dense gate_up_proj) instead of the hub names (language_model.model.*, block_sparse_moe.experts.*.w{1,2,3}).

How

New modelopt/torch/export/quant_aware_conversion.py performs a quantization-aware reverse conversion that carries each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias) through two primitives:

Rename — key-level substitution; scale siblings follow the module path automatically.
Output-dim un-fuse split — split weight/weight_scale/bias on the fused (output) dim; duplicate the 0-d scalar weight_scale_2/input_scale to each part.

Reverse rules are derived best-effort from the model's conversion mapping. Any op not yet reversible quant-aware (notably the 3-D stacked-expert MergeModulelist case) raises QuantConversionUnsupportedError, and export_hf_checkpoint falls back to the prior in-memory-name behavior with a warning — so this change is non-breaking.

Tests

CPU unit tests (tests/unit/torch/export/test_quant_aware_conversion.py), tensor shapes/dtypes mirroring a real NVFP4 MiniMax-M3 linear:

rename carries scale siblings
dense gate_up_proj un-fuse with scale split + scalar duplication (round-trips via concat)
3-D stacked-expert guard → QuantConversionUnsupportedError
non-divisible split guard
end-to-end MiniMax-M3-like reversal (post-conversion names → hub names)

Validation status — DRAFT

✅ Engine unit-tested on CPU; ruff clean.
⚠️ Not yet validated end-to-end. Needs a real M3 quantize → export_hf_checkpoint on a box with the transformers version that carries the minimax_m3_vl mapping + GPU, to exercise:
1. _build_reverse_rules against transformers' actual converter objects (API differs across versions; 4.51.3 has no M3 mapping).
2. save_pretrained(state_dict=<hub-named>) integration (tied-weights / shard metadata) when keys no longer match model.state_dict().
3. Re-load of the exported checkpoint in vLLM/TRT-LLM.

Follow-ups

3-D stacked-expert un-stack + un-fuse with scale layout (currently falls back).
Upstream fix to transformers Chunk.convert() for 0-d tensors, then drop the _patch_revert_weight_conversion no-op.

🤖 Generated with Claude Code

register_fused_experts_on_the_fly skipped fused-expert modules lacking an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom GPT-OSS-style gated activation between its two F.linear calls instead of an act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed experts stayed unquantized (an experts-only recipe matched nothing) and HF export failed with NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls, gate_up then down), so act_fn is irrelevant to quantization, calibration, and export. Drop the requirement from _is_fused_experts_module. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3 detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid HF checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

Condense the act_fn explanation in _is_fused_experts_module's docstring and remove the work-log comment block in the corresponding unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

…port ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on 0-d scalar scale tensors). As a result, when transformers applies a load-time conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered model/language_model prefix), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract (observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names). Add a quantization-aware reverse that carries each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias) through: - renames (key-level substitution; scale siblings follow the module path) - output-dim un-fuse splits (split weight/weight_scale on the fused dim, duplicate 0-d scalar weight_scale_2/input_scale to each part) Reverse rules are derived best-effort from the model's conversion mapping; any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist) raises QuantConversionUnsupportedError and the export falls back to the prior in-memory-name behavior with a warning (non-breaking). Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with scale split + scalar duplication, 3-D expert + non-divisible guards, end-to-end MiniMax-M3-like reversal). End-to-end export validation on a real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU) is still pending. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>

copy-pr-bot · 2026-06-26T00:20:16Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-26T00:20:21Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3f839b8-ac00-4c60-be8d-e285cd7ef88d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/unified-export-quant-aware-revert

_{Comment @coderabbitai help to get the list of available commands.}

github-actions · 2026-06-26T00:24:27Z

PR Preview Action v1.8.1
🚀 View preview at https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1833/
Built to branch `gh-pages` at 2026-06-26 00:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

codecov · 2026-06-26T00:31:10Z

Codecov Report

❌ Patch coverage is 56.43564% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (1c6bdb3) to head (e110c15).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/export/quant_aware_conversion.py	58.51%	39 Missing ⚠️
modelopt/torch/export/unified_export_hf.py	16.66%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1833      +/-   ##
==========================================
- Coverage   77.36%   76.72%   -0.65%     
==========================================
  Files         513      514       +1     
  Lines       56894    56994     +100     
==========================================
- Hits        44016    43728     -288     
- Misses      12878    13266     +388

Flag	Coverage Δ
unit	`54.64% <56.43%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Edwardf0t1 and others added 3 commits June 25, 2026 19:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(export): quant-aware reverse weight conversion for unified HF export#1833

feat(export): quant-aware reverse weight conversion for unified HF export#1833
Edwardf0t1 wants to merge 3 commits into
mainfrom
feat/unified-export-quant-aware-revert

Edwardf0t1 commented Jun 26, 2026

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 26, 2026

Built to branch `gh-pages` at 2026-06-26 00:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Edwardf0t1 commented Jun 26, 2026

What

How

Tests

Validation status — DRAFT

Follow-ups

Uh oh!

copy-pr-bot Bot commented Jun 26, 2026

Uh oh!

coderabbitai Bot commented Jun 26, 2026

Review skipped

Uh oh!

github-actions Bot commented Jun 26, 2026

Built to branch gh-pages at 2026-06-26 00:24 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

codecov Bot commented Jun 26, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Built to branch `gh-pages` at 2026-06-26 00:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.