Skip to content

feat(export): quant-aware reverse weight conversion for unified HF export#1833

Draft
Edwardf0t1 wants to merge 3 commits into
mainfrom
feat/unified-export-quant-aware-revert
Draft

feat(export): quant-aware reverse weight conversion for unified HF export#1833
Edwardf0t1 wants to merge 3 commits into
mainfrom
feat/unified-export-quant-aware-revert

Conversation

@Edwardf0t1

Copy link
Copy Markdown
Contributor

What

ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side revert_weight_conversion (it raises IndexError on ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-time conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered model/language_model prefix), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.

Observed concretely on MiniMax-M3: nvidia/MiniMax-M3-NVFP4-v1 emitted the converted/fused names (model.language_model.*, mlp.experts.*, fused dense gate_up_proj) instead of the hub names (language_model.model.*, block_sparse_moe.experts.*.w{1,2,3}).

How

New modelopt/torch/export/quant_aware_conversion.py performs a quantization-aware reverse conversion that carries each weight's companion scale tensors (weight_scale, weight_scale_2, input_scale, weight_scale_inv, bias) through two primitives:

  • Rename — key-level substitution; scale siblings follow the module path automatically.
  • Output-dim un-fuse split — split weight/weight_scale/bias on the fused (output) dim; duplicate the 0-d scalar weight_scale_2/input_scale to each part.

Reverse rules are derived best-effort from the model's conversion mapping. Any op not yet reversible quant-aware (notably the 3-D stacked-expert MergeModulelist case) raises QuantConversionUnsupportedError, and export_hf_checkpoint falls back to the prior in-memory-name behavior with a warning — so this change is non-breaking.

Tests

CPU unit tests (tests/unit/torch/export/test_quant_aware_conversion.py), tensor shapes/dtypes mirroring a real NVFP4 MiniMax-M3 linear:

  • rename carries scale siblings
  • dense gate_up_proj un-fuse with scale split + scalar duplication (round-trips via concat)
  • 3-D stacked-expert guard → QuantConversionUnsupportedError
  • non-divisible split guard
  • end-to-end MiniMax-M3-like reversal (post-conversion names → hub names)

Validation status — DRAFT

  • ✅ Engine unit-tested on CPU; ruff clean.
  • ⚠️ Not yet validated end-to-end. Needs a real M3 quantize → export_hf_checkpoint on a box with the transformers version that carries the minimax_m3_vl mapping + GPU, to exercise:
    1. _build_reverse_rules against transformers' actual converter objects (API differs across versions; 4.51.3 has no M3 mapping).
    2. save_pretrained(state_dict=<hub-named>) integration (tied-weights / shard metadata) when keys no longer match model.state_dict().
    3. Re-load of the exported checkpoint in vLLM/TRT-LLM.

Follow-ups

  • 3-D stacked-expert un-stack + un-fuse with scale layout (currently falls back).
  • Upstream fix to transformers Chunk.convert() for 0-d tensors, then drop the _patch_revert_weight_conversion no-op.

🤖 Generated with Claude Code

Edwardf0t1 and others added 3 commits June 25, 2026 19:36
register_fused_experts_on_the_fly skipped fused-expert modules lacking an
act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom
GPT-OSS-style gated activation between its two F.linear calls instead of an
act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed
experts stayed unquantized (an experts-only recipe matched nothing) and HF
export failed with NotImplementedError.

_QuantFusedExperts is activation-agnostic (it only intercepts the two
F.linear calls, gate_up then down), so act_fn is irrelevant to quantization,
calibration, and export. Drop the requirement from _is_fused_experts_module.
Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3.

Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3
detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid
HF checkpoint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Condense the act_fn explanation in _is_fused_experts_module's docstring
and remove the work-log comment block in the corresponding unit test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…port

ModelOpt's unified HF export builds the state dict from the in-memory
(transformers post-conversion) module names and disables transformers'
save-side revert_weight_conversion (it raises IndexError on 0-d scalar
scale tensors). As a result, when transformers applies a load-time
conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered
model/language_model prefix), the exported tensor names no longer match
the original HF hub checkpoint, breaking the unified-checkpoint contract
(observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names).

Add a quantization-aware reverse that carries each weight's companion
scale tensors (weight_scale, weight_scale_2, input_scale,
weight_scale_inv, bias) through:
  - renames (key-level substitution; scale siblings follow the module path)
  - output-dim un-fuse splits (split weight/weight_scale on the fused dim,
    duplicate 0-d scalar weight_scale_2/input_scale to each part)
Reverse rules are derived best-effort from the model's conversion mapping;
any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist)
raises QuantConversionUnsupportedError and the export falls back to the
prior in-memory-name behavior with a warning (non-breaking).

Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with
scale split + scalar duplication, 3-D expert + non-divisible guards,
end-to-end MiniMax-M3-like reversal). End-to-end export validation on a
real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU)
is still pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 26, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: c3f839b8-ac00-4c60-be8d-e285cd7ef88d

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/unified-export-quant-aware-revert

Comment @coderabbitai help to get the list of available commands.

@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1833/

Built to branch gh-pages at 2026-06-26 00:24 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 56.43564% with 44 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.72%. Comparing base (1c6bdb3) to head (e110c15).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/quant_aware_conversion.py 58.51% 39 Missing ⚠️
modelopt/torch/export/unified_export_hf.py 16.66% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1833      +/-   ##
==========================================
- Coverage   77.36%   76.72%   -0.65%     
==========================================
  Files         513      514       +1     
  Lines       56894    56994     +100     
==========================================
- Hits        44016    43728     -288     
- Misses      12878    13266     +388     
Flag Coverage Δ
unit 54.64% <56.43%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant