feat(export): quant-aware reverse weight conversion for unified HF export#1833
feat(export): quant-aware reverse weight conversion for unified HF export#1833Edwardf0t1 wants to merge 3 commits into
Conversation
register_fused_experts_on_the_fly skipped fused-expert modules lacking an act_fn attribute. MiniMaxM3VLExperts (transformers 5.12.0) uses a custom GPT-OSS-style gated activation between its two F.linear calls instead of an act_fn attribute, so it was never wrapped as _QuantFusedExperts: routed experts stayed unquantized (an experts-only recipe matched nothing) and HF export failed with NotImplementedError. _QuantFusedExperts is activation-agnostic (it only intercepts the two F.linear calls, gate_up then down), so act_fn is irrelevant to quantization, calibration, and export. Drop the requirement from _is_fused_experts_module. Enables NVFP4/FP8 PTQ + export for MiniMax-M2 / MiniMax-M3. Verified end-to-end: experts-only NVFP4 + FP8 KV PTQ of MiniMaxAI/MiniMax-M3 detects MiniMaxM3VLExperts, quantizes all 57 MoE layers, and exports a valid HF checkpoint. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Condense the act_fn explanation in _is_fused_experts_module's docstring and remove the work-log comment block in the corresponding unit test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
…port
ModelOpt's unified HF export builds the state dict from the in-memory
(transformers post-conversion) module names and disables transformers'
save-side revert_weight_conversion (it raises IndexError on 0-d scalar
scale tensors). As a result, when transformers applies a load-time
conversion_mapping (fused gate_up_proj, renamed MoE leaves, reordered
model/language_model prefix), the exported tensor names no longer match
the original HF hub checkpoint, breaking the unified-checkpoint contract
(observed: MiniMax-M3 NVFP4-v1 emitted fused/renamed names).
Add a quantization-aware reverse that carries each weight's companion
scale tensors (weight_scale, weight_scale_2, input_scale,
weight_scale_inv, bias) through:
- renames (key-level substitution; scale siblings follow the module path)
- output-dim un-fuse splits (split weight/weight_scale on the fused dim,
duplicate 0-d scalar weight_scale_2/input_scale to each part)
Reverse rules are derived best-effort from the model's conversion mapping;
any op not yet reversible quant-aware (e.g. stacked-expert MergeModulelist)
raises QuantConversionUnsupportedError and the export falls back to the
prior in-memory-name behavior with a warning (non-breaking).
Adds CPU unit tests (rename carries scales, dense gate_up un-fuse with
scale split + scalar duplication, 3-D expert + non-divisible guards,
end-to-end MiniMax-M3-like reversal). End-to-end export validation on a
real M3 quantize+export (transformers with the minimax_m3_vl mapping, GPU)
is still pending.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1833 +/- ##
==========================================
- Coverage 77.36% 76.72% -0.65%
==========================================
Files 513 514 +1
Lines 56894 56994 +100
==========================================
- Hits 44016 43728 -288
- Misses 12878 13266 +388
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
What
ModelOpt's unified HF export builds the state dict from the in-memory (transformers post-conversion) module names and disables transformers' save-side
revert_weight_conversion(it raisesIndexErroron ModelOpt's 0-d scalar scale tensors). So when transformers applies a load-timeconversion_mapping(fusedgate_up_proj, renamed MoE leaves, reorderedmodel/language_modelprefix), the exported tensor names no longer match the original HF hub checkpoint, breaking the unified-checkpoint contract.Observed concretely on MiniMax-M3:
nvidia/MiniMax-M3-NVFP4-v1emitted the converted/fused names (model.language_model.*,mlp.experts.*, fused densegate_up_proj) instead of the hub names (language_model.model.*,block_sparse_moe.experts.*.w{1,2,3}).How
New
modelopt/torch/export/quant_aware_conversion.pyperforms a quantization-aware reverse conversion that carries each weight's companion scale tensors (weight_scale,weight_scale_2,input_scale,weight_scale_inv,bias) through two primitives:weight/weight_scale/biason the fused (output) dim; duplicate the 0-d scalarweight_scale_2/input_scaleto each part.Reverse rules are derived best-effort from the model's conversion mapping. Any op not yet reversible quant-aware (notably the 3-D stacked-expert
MergeModulelistcase) raisesQuantConversionUnsupportedError, andexport_hf_checkpointfalls back to the prior in-memory-name behavior with a warning — so this change is non-breaking.Tests
CPU unit tests (
tests/unit/torch/export/test_quant_aware_conversion.py), tensor shapes/dtypes mirroring a real NVFP4 MiniMax-M3 linear:gate_up_projun-fuse with scale split + scalar duplication (round-trips via concat)QuantConversionUnsupportedErrorValidation status — DRAFT
export_hf_checkpointon a box with the transformers version that carries theminimax_m3_vlmapping + GPU, to exercise:_build_reverse_rulesagainst transformers' actual converter objects (API differs across versions; 4.51.3 has no M3 mapping).save_pretrained(state_dict=<hub-named>)integration (tied-weights / shard metadata) when keys no longer matchmodel.state_dict().Follow-ups
Chunk.convert()for 0-d tensors, then drop the_patch_revert_weight_conversionno-op.🤖 Generated with Claude Code