[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869 by kevalmorabia97 · Pull Request #1880 · NVIDIA/Model-Optimizer

kevalmorabia97 · 2026-07-01T21:45:50Z

Cherry-picked PRs

#1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was
renamed llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on
main, but the actual fix targets the init_empty_weights / from_config block that
already exists on the release branch. Accompanying unit tests were ported (15 passed).

Summary by CodeRabbit

New Features
- Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration.
Bug Fixes
- Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation.
- Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export.
- Standardized Puzzletron validation split naming to validation.
Documentation
- Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples.

### What does this PR do? Type of change: Bug fix  This is hit when NeMo-RL starts the ModelOpt Megatron policy worker for a quantized run. The worker imports `modelopt.torch.quantization` during Ray actor initialization, before training starts. ModelOpt then imports `quant_linear`, which imports `backends`; backend registration imports `nvfp4_gemm`, and the old `nvfp4_gemm` module imports `RealQuantLinear` back from the still-initializing `quant_linear` module. The result is a Python partial-initialization import cycle during worker startup. Keep backend module import/registration independent of `quant_linear` completion. Import `RealQuantLinear` only inside the backend availability check, after `quant_linear` has finished initializing, while preserving the explicit `isinstance(module, RealQuantLinear)` check. ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Refactor** * Optimized initialization efficiency in FP8 and NVFP4 quantization backends by refining import timing mechanisms. * Enhanced availability check logic to defer certain imports, reducing initialization overhead without affecting quantization functionality. * No changes to public APIs, feature behavior, or mathematical operations performed during quantization.  --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…xample (#1808) ### What does this PR do? Type of change: documentation TensorRT-Edge-LLM v0.8.0 consolidated its CLI entry points, leaving the example commands in `examples/torch_onnx/README.md` referencing tools that no longer exist (e.g. `tensorrt-edgellm-export-visual`). This updates the README to the current interface: - `tensorrt-edgellm-quantize-llm` / `tensorrt-edgellm-quantize-draft` → `tensorrt-edgellm-quantize {llm,draft}` (subcommands) - `tensorrt-edgellm-export-llm` / `-export-visual` / `-export-draft` → unified `tensorrt-edgellm-export` with positional `model` / `output_dir` args and automatic VLM/audio component detection - `--is_eagle_base` → `--eagle-base` - Updated the CLI Tools table and the LLM / VLM / EAGLE examples accordingly ### Usage N/A — documentation change. ### Testing Verified against the live `main` branch of TensorRT-Edge-LLM by running the actual entry-point code (`python -m tensorrt_edgellm.scripts.quantize/export`): - `--help` runs cleanly for `quantize`, `quantize llm`, `quantize draft`, and `export`; all documented flags (`--model_dir`, `--output_dir`, `--quantization`, `--base_model_dir`, `--draft_model_dir`, positional `model`/`output_dir`, `--eagle-base`) are present. - Drove the parser with the exact README commands — they parse and advance into the real quantize/export logic. - Confirmed the old names are gone: `quantize-llm` subcommand rejected, `--is_eagle_base` rejected, `scripts.export_visual` module not found. ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: N/A (documentation only) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: N/A (documentation only) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (minor docs change) > 🤖 _Generated by Claude (AI agent)._  ## Summary by CodeRabbit * **Documentation** * Updated TensorRT-Edge-LLM CLI documentation to reflect consolidated command structure * Updated command examples for LLM, VLM, and EAGLE speculative decoding workflows * Documented new unified CLI interfaces with updated subcommands and flags  Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

### What does this PR do? uses `TorchDistLoadShardedStrategy` instead of deprecated `get_default_load_sharded_strategy`.  ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit ## Release Notes * **Bug Fixes** * Improved distributed checkpoint loading strategy for enhanced reliability and consistency when loading sharded checkpoint extra-state data.  Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…eakly-typed models (#1627) Type of change: Bug fix ONNX INT8 + FP16 quantization (`--quantize_mode int8 --high_precision_dtype fp16`) crashed with a `ShapeInferenceError` on weakly-typed models (e.g. TensorFlow exports). Two root causes, both fixed in `modelopt/onnx/utils.py`: - **Stale rank-0 output shapes.** Such models can declare a `graph.output` rank that conflicts with the graph topology — typically a leftover rank-0 (scalar) annotation on a tensor that is really rank-2+. A stale rank-0 passes `onnx.checker` (a scalar is valid) but poisons downstream shape inference: ORT fails while augmenting the model for INT8 calibration (`axis must be in [-rank, rank-1]. Input rank was 0`), and `onnx.shape_inference(strict_mode=True)` raises `Inferred shape and existing shape differ in rank` during FP16 autocast. `clear_stale_value_info` now reconciles stale output shapes — it clears and re-derives them from the operator graph via ORT symbolic shape inference (falling back to the size-aware `infer_shapes` wrapper) and adopts the inferred shape. A graph output is never left without a shape field (which `onnx.checker` requires); an output whose shape cannot be re-derived keeps its original declaration. - **Ops ONNX static shape inference can't resolve.** The same models can contain ops (e.g. `TopK`) that ONNX's static shape inference gives up on, leaving downstream tensors untyped and breaking AutoCast's type lookups. `infer_types` now falls back to the schema-based standalone type inferencer when ONNX shape inference raises or leaves tensors untyped, running the fallback on the shape-inferred model so any shapes ONNX did derive are preserved. Healthy models are unaffected: re-inference reproduces their existing output shapes, and a fully typed graph skips the fallback. ```bash python -m modelopt.onnx.quantization \ --quantize_mode int8 --high_precision_dtype fp16 \ --onnx_path model.onnx --output_path model_int8_fp16.onnx ``` - Added CPU-only unit tests in `tests/unit/onnx/test_onnx_utils.py`: stale rank-0 output reconciliation, preservation of a valid output shape, and the type-inference fallback on a `TopK`-overflow model. Run: `CUDA_VISIBLE_DEVICES="" pytest tests/unit/onnx/test_onnx_utils.py`. - Verified end-to-end on a weakly-typed object-detection model that previously failed: the command now completes and produces a valid quantized FP16 model (`onnx.checker` passes, ORT loads it, QuantizeLinear/DequantizeLinear nodes inserted, FP16 initializers present, all graph-output ranks correct). - Full `tests/unit/onnx` suite: no new failures versus the base branch. Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅  * **Bug Fixes** * Improved mixed-precision/FP16 conversions to prevent stale or missing shape/type metadata. Output tensor rank/shape declarations are reconciled after clearing stale info, and stricter ONNX shape/type inference now falls back to standalone type inference when certain ops can’t be resolved. * **Tests** * Expanded ONNX utility and autocast test coverage for fixing rank-0 output shapes, preserving valid static and dynamic `dim_param` output declarations, and verifying strict-mode fallback for unresolved `TopK` during FP16 conversion.  --------- Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…1824) ### What does this PR do? Type of change: Bug fix Fixes [NVBug 6360175](https://nvbugspro.nvidia.com/bug/6360175) / OMNIML-5265: quantizing a model whose weights are stored natively in FP8 (e.g. DeepSeek-V3 in `float8_e4m3fn`) crashes during `mtq.quantize` calibration with: ``` File ".../modelopt/torch/quantization/utils/core_utils.py", line 162, in reduce_amax max_val = torch.max(input) NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn' ``` **Root cause:** FP8 dtypes (`float8_e4m3fn` / `float8_e5m2`) implement no full-tensor reduction kernel (`max_all_cuda`/`min_all_cuda`), nor `amax`/`amin`, `abs`, or elementwise `maximum`. `reduce_amax` called these directly on the FP8 weight tensor. **Fix:** Upcast FP8 inputs to the default float dtype (`torch.get_default_dtype()`) at the top of `reduce_amax`, before any reduction. The upcast is **lossless** (any default float dtype represents every FP8 value exactly) and only affects the FP8 path — the common (fp16/bf16/fp32) path is untouched. Placing the upcast at the top covers all branches (`torch.max`/`min`, `torch.amax`/`amin`, `torch.abs`), not just the line in the traceback. ### Usage No API change. Quantization of natively-FP8 checkpoints (e.g. DeepSeek-V3 NVFP4 PTQ) now runs through calibration instead of raising. ### Testing - New CPU regression test `test_reduce_amax_fp8` in `tests/unit/torch/quantization/test_utils.py` covering both FP8 dtypes (`float8_e4m3fn`, `float8_e5m2`) across all axis modes (`None`, `0`, `1`, `(0, 1)`); asserts results equal the float reference and the output dtype is the default float dtype. CPU reproduces the original error (no FP8 reduction kernel there either), so the test is GPU-free. - `pre-commit run --files ...` passes (ruff, mypy, bandit, license, rst checks). ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update Changelog?: ✅ (0.45 Bug Fixes) - Did you get Claude approval on this PR?: ❌ (not yet) ### Additional Information NVBug 6360175 is tagged `Committed_ModelOpt_0.45.0` (regression); the changelog entry is under 0.45 and this will be cherry-picked to `release/0.45` after merge. Supersedes #1823, which got a stuck head ref (frozen at the original commit, no sync on force-push) after the repo move `TensorRT-Model-Optimizer` → `Model-Optimizer`; it could not be re-synced or reopened, so this PR replaces it from the same branch. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…tion (#1826) ## What does this PR do? **Type of change:** Bug fix **Overview:** Adds `*vision_model*` and `*multi_modal_projector*` to the `default_disabled_quantizers` PTQ unit so the Llama-4 vision branch stays in BF16 by default. ## Details The vision-exclusion patterns in `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml` (`*embed_vision*` / `*vision_tower*` / `*visual*`) cover gemma / Qwen-VL / Kimi naming but miss Llama-4, whose encoder is named `vision_model` and whose projector is `multi_modal_projector`. As a result, `general/ptq/nvfp4_default-kv_fp8` (and any recipe importing this unit) quantized the Llama-4-Scout vision tower. Export then crashed on `vision_model.patch_embedding.linear`, whose `in_features=588` (3×14×14) is not divisible by the NVFP4 block size: ``` AssertionError: Weight shape is not divisible for block size for block quantization. Failed to export module 'vision_model.patch_embedding.linear' (type=QuantLinear) ``` Adding the two patterns keeps the Llama-4 vision branch in BF16 by default, matching the existing behavior for other VL models (gemma-4, Qwen3.5-VL, Kimi). Fixes NVBugs 6359097. ## Testing - YAML validates and pre-commit (incl. `validate modelopt recipes`) passes. - The `-novit` recipes import this unit and inherit the fix automatically. ## Before your PR is "*Ready for review*" - [x] Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed. - [x] Is this change backward compatible? **Yes** — only expands the default disable list to additional vision-branch module names. - [x] Did you write any new necessary tests? **N/A** — config-only change. - [x] Did you add or update any necessary documentation? **Yes** — updated the inline rationale comment with the Llama-4 naming and NVBug reference. - [x] Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)? **N/A** — recipe-config bug fix. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Bug Fixes** * Expanded the default list of vision-related components that stay unquantized, improving support for multimodal models. * Added broader exclusions for additional vision and projector patterns to better preserve model quality by default.  Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

#1830) …eparation script ### What does this PR do? Type of change: ?   ### Usage ```python # Add a code snippet demonstrating how to use this ``` ### Testing  ### Before your PR is "*Ready for review*" Make sure you read and follow [Contributor guidelines](https://github.com/NVIDIA/Model-Optimizer/blob/main/CONTRIBUTING.md) and your commits are signed (`git commit -s -S`). Make sure you read and follow the [Security Best Practices](https://github.com/NVIDIA/Model-Optimizer/blob/main/SECURITY.md#security-coding-practices-for-contributors) (e.g. avoiding hardcoded `trust_remote_code=True`, `torch.load(..., weights_only=False)`, `pickle`, etc.). - Is this change backward compatible?: ✅ / ❌ / N/A  - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: ✅ / ❌ / N/A  - Did you write any new necessary tests?: ✅ / ❌ / N/A  - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ / ❌ / N/A  - Did you get Claude approval on this PR?: ✅ / ❌ / N/A  ### Additional Information   ## Summary by CodeRabbit * **Bug Fixes** * Standardized the validation dataset name from `valid` to `validation` across multiple example configurations and dataset preparation outputs. * Improved compatibility for validation workflows by aligning dataset naming used during training and evaluation.  --------- Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Co-authored-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…er) (#1760) ### What does this PR do? Type of change: Bug fix Adds a new built-in PTQ recipe `general/ptq/nvfp4_mlp_only-novit-kv_fp8` that is identical to `nvfp4_mlp_only-kv_fp8` but excludes the VL vision tower from quantization. **Root cause (NVBugs 6287461):** The bare `*mlp*` enable globs in `nvfp4_mlp_only-kv_fp8` also match VL vision-tower block MLPs (e.g. Kimi-K2.5 `vision_tower.encoder.blocks.*.mlp.fc0/fc1`). Quantizing the ViT FFNs to NVFP4 is both quality-harmful (degenerate image embeddings) and can break export: Kimi-K2.5's MoonViT `vt_intermediate_size=4304` is not divisible by the NVFP4 packing constraint (2 × group_size = 32, since 4-bit values pack 2-per-byte). `4304 = 16 × 269` is divisible by 16 but not 32, so the compressed-tensors export raises `ValueError: tensor column shape must be divisible by the given group_size 32 but got 4304`. All language-model dims (2048 / 7168 / 18432) are divisible by 32 and quantize fine. The new recipe appends `*visual*` / `*vision_tower*` disable rules (after the `*mlp*` enables, so the disable wins), mirroring the existing `nvfp4_mlp_only_mse-kv_fp8_cast-novit` recipe and NVIDIA's reference `nvidia/Kimi-K2.5-NVFP4` checkpoint (which excludes the vision tower, multimodal projector, attention, and lm_head). ### Usage ```bash python hf_ptq.py --model /local/Kimi-K2.5 \ --recipe general/ptq/nvfp4_mlp_only-novit-kv_fp8 \ --batch_size 1 --calib_size 32 \ --export_path /local/Kimi-K2.5-nvfp4_mlp_only-novit-kv_fp8 --trust_remote_code ``` ### Testing - Registered in the `tests/unit/recipe/test_loader.py` builtin smoke list (`test_load_recipe_all_builtins`). - Added a focused regression test (`test_nvfp4_mlp_only_novit_recipe_disables_vision_quantizers`) asserting the `*visual*` / `*vision_tower*` quantizers are disabled. - All pre-commit hooks pass, including the `validate modelopt recipes` hook. ### Before your PR is "*Ready for review*" - Is this change backward compatible?: ✅ (additive — new recipe file only) - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ (added to builtin recipe smoke test + vision-disable regression test) - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: N/A (new built-in recipe, no API change) - Did you get Claude approval on this PR?: ❌ (pending) ### Additional Information Fixes NVBugs 6287461 (Kimi-K2.5 `nvfp4_mlp_only-kv_fp8` quant failure). Related Jira: OMNIML-5005. 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **New Features** * Added a new built-in PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV cache support. * The recipe excludes vision-related components from quantization. * **Documentation** * Updated the shipped recipes list to include the new PTQ recipe. * **Tests** * Expanded recipe-loading coverage to include the new built-in PTQ recipe. * Added a check to confirm vision quantizers remain disabled for this recipe.  --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

) ### What does this PR do? Fix lm_eval_hf freezing issue on multi-gpu slurm interactive node. **Note (Slurm interactive nodes):** On Slurm interactive nodes, `WORLD_SIZE` is set to the number of available GPUs in the shell environment. Running `python` directly causes `lm_eval` to hang waiting for peer ranks that were never spawned. Prepend `WORLD_SIZE=1` to any of the above commands to fix this. ### Usage see examples/llm_eval/README.md ### Testing Tested manually on a slurm interactive node with 8 GPUs. - Is this change backward compatible?: ✅ - Did you write any new necessary tests?: N/A  ## Summary by CodeRabbit * **Documentation** * Updated the LLM evaluation baseline to explicitly support both standard Hugging Face models and heterogeneous pruned Puzzletron checkpoints. * Added a Slurm interactive-nodes note advising `WORLD_SIZE=1` for direct `python` runs, while noting distributed launch tooling handles `WORLD_SIZE`. * Recommended using `--limit 10` for quick smoke tests. * Simplified evaluation instructions by linking to the LM-Eval-Harness guide instead of repeating commands/snippets.  --------- Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

…718750) (#1858) Type of change: Bug fix Fixes NVBug 5718750 (JIRA OMNIML-3159): a Qwen3-30B-A3B NVFP4 checkpoint fails to load in vLLM/SGLang with `AssertionError: Tried to load weights of size [128, 2048] to a parameter of size [128, 1024]`. **Root cause.** During unified HF export, `get_quant_config` records a module in `exclude_modules` **only if it carries a quantizer** (even a disabled one). MoE routers are intentionally kept in original precision via the `*mlp.gate.*` / `*router*` disable patterns — but those only *disable* a quantizer that already exists; they don't create one. - On `transformers<5.0` the router (`mlp.gate`) is a plain `nn.Linear`, so `mtq.quantize` attaches a (disabled) quantizer → it is recorded → correctly listed in `exclude_modules`. - On `transformers>=5.0` MoE experts are batched and the router is a `TopKRouter` (not an `nn.Linear`), so it **never receives a quantizer**. The exporter skips it (`has_quantizers == False`), so its BF16 weight is written to the checkpoint but omitted from `exclude_modules`. vLLM/SGLang then treat the router as a quantized (packed-FP4) weight and fail to load it. QA confirmed the failing export ran with **transformers 5.12.0**, matching this analysis. **Fix.** Detect MoE routers structurally — an MoE block that exposes an `experts` container plus a weight-bearing `gate` / `router` / `shared_expert_gate` submodule — and record any such router that is left unquantized as `QUANTIZATION_NONE`, so it always lands in `exclude_modules` regardless of whether a quantizer happens to be attached. Routers the user opted to quantize (non-NONE format) are left untouched. No API change. The exported `hf_quant_config.json` / `config.json` `ignore` list now contains the MoE router gates (e.g. `model.layers.*.mlp.gate`) on the transformers-5 batched-experts path, as it already did on transformers-4. - New CPU unit test `tests/unit/torch/export/test_get_quantization.py::test_moe_router_excluded_when_not_quantized`: quantizes a fake MoE model whose router is a non-`nn.Linear` `TopKRouter` (no quantizer attached) to NVFP4 and asserts the router appears in `exclude_modules` while the quantized experts do not. - `tests/unit/torch/export/test_get_quantization.py`, `test_unified_export_hf.py`, `test_layer_utils.py` all pass. - `pre-commit` (ruff, ruff-format, mypy, bandit, license) passes on the changed files. - Is this change backward compatible?: ✅ - If you copied code from any other sources or added a new PIP dependency, did you follow guidance in `CONTRIBUTING.md`: N/A - Did you write any new necessary tests?: ✅ - Did you update [Changelog](https://github.com/NVIDIA/Model-Optimizer/blob/main/CHANGELOG.rst)?: ✅ - Did you get Claude approval on this PR?: ❌ (pending) NVBug 5718750 · JIRA OMNIML-3159. Same `transformers>=5.0` batched-experts MoE export path as the recent fused-experts work (OMNIML-5003). 🤖 Generated with [Claude Code](https://claude.com/claude-code)  * **Bug Fixes** * Exported quantization configs now always include unquantized mixture-of-experts router/gate components, preventing downstream weight-shape mismatches during loading. * Improved detection for MoE routers/gates that aren’t standard linear layers, ensuring they’re no longer inadvertently omitted when present. * Added MoE-focused unit tests to confirm router exclusion behavior while keeping quantized experts intact, including correct handling when the MoE block is the root module.  --------- Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai · 2026-07-01T21:47:18Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7bf46f0a-3ac2-45ba-86a3-fbd467afd79f

📥 Commits

Reviewing files that changed from the base of the PR and between 20a4a85 and 86b500f.

📒 Files selected for processing (2)

examples/llm_ptq/example_utils.py
tests/examples/llm_ptq/test_example_utils.py

📝 Walkthrough

Walkthrough

This PR updates ONNX autocast inference and stale-shape handling, excludes MoE routers from quantized export, upcasts FP8 tensors before amax, changes several loader/config defaults, adds a PTQ recipe, and refreshes related documentation and tests.

Changes

ONNX Autocast and Shape Reconciliation

Layer / File(s)	Summary
Strict inference in autocast conversion `modelopt/onnx/autocast/convert.py`	`convert_to_mixed_precision` and `convert_to_f16` now call `infer_types(..., strict_mode=True)`.
Shared graph clearing utility `modelopt/onnx/autocast/precisionconverter.py`, `modelopt/onnx/autocast/utils.py`	`PrecisionConverter.convert()` uses `utils.clear_types_and_shapes_recursive`, and the class-local recursive clearer is removed.
Fallback inference and output shape reconciliation `modelopt/onnx/utils.py`	`infer_types()` falls back to standalone type inference on failure, and `clear_stale_value_info()` now reconciles stale `graph.output` shapes.
ONNX autocast and utility tests `tests/unit/onnx/autocast/test_autocast.py`, `tests/unit/onnx/test_onnx_utils.py`	Tests cover strict conversion fallback, stale output-shape repair, dynamic dim preservation, and typed outputs after failed ONNX inference.

MoE Export Exclusion and FP8 Reduction

Layer / File(s)	Summary
MoE router quant config wiring `modelopt/torch/export/quant_utils.py`	`_get_unquantized_moe_router_names` detects MoE router/gate modules and `get_quant_config` marks them `QUANTIZATION_NONE`.
MoE export tests `tests/unit/torch/export/test_get_quantization.py`	Tests build fake MoE modules and verify router exclusion, expert export quantization, and root-module router naming.
FP8 upcast in reduce_amax `modelopt/torch/quantization/utils/core_utils.py`, `tests/unit/torch/quantization/test_utils.py`	`reduce_amax` upcasts FP8 inputs before reduction, and the regression test checks FP8 dtypes across axes.

Lazy Imports, Checkpoint Loading, and PTQ Recipe

Layer / File(s)	Summary
Lazy backend imports `modelopt/torch/quantization/backends/fp8_per_tensor_gemm.py`, `modelopt/torch/quantization/backends/nvfp4_gemm.py`	`RealQuantLinear` is imported inside the FP8 and NVFP4 availability checks instead of at module scope.
Sharded checkpoint load strategy `modelopt/torch/opt/plugins/mcore_dist_checkpointing.py`	The plugin loads extra state with `TorchDistLoadShardedStrategy()`.
Validation split rename and PTQ recipe `modelopt/torch/puzzletron/dataset/prepare_dataset.py`, `examples/puzzletron/configs/*/validate_model_defaults.yaml`, `modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml`, `modelopt_recipes/general/ptq/nvfp4_mlp_only-novit-kv_fp8.yaml`, `modelopt_recipes/ptq.md`, `tests/unit/recipe/test_loader.py`	`valid` is renamed to `validation` in Puzzletron dataset/config defaults, and a new NVFP4 MLP-only no-ViT PTQ recipe is added with matching config, docs, and loader tests.

Documentation Updates

Layer / File(s)	Summary
Release notes and docs dependency pin `CHANGELOG.rst`, `pyproject.toml`	`CHANGELOG.rst` adds bug-fix entries and `pyproject.toml` pins `sphinx-argparse`.
Evaluation docs `examples/llm_eval/README.md`, `examples/puzzletron/README.md`	`examples/llm_eval/README.md` expands baseline guidance and removes the standalone Puzzletron subsection; `examples/puzzletron/README.md` points to the shared LM-Eval-Harness instructions.
TensorRT-Edge-LLM CLI docs `examples/torch_onnx/README.md`	The CLI verification steps, tool listing, and LLM/VLM/EAGLE examples are updated to the consolidated command structure.

LLM PTQ Config Resolution

Layer / File(s)	Summary
Config and dtype resolution `examples/llm_ptq/example_utils.py`	`example_utils.py` re-derives init configs for remote-code models, resolves dtype from config fields, and applies dtype into model kwargs before init/load.
PTQ helper tests `tests/examples/llm_ptq/test_example_utils.py`	Tests cover remote config re-derivation, fallback behavior, and dtype keyword selection in `get_model`.

Estimated code review effort: 4 (Complex) | ~60 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#293: Related to Eagle speculative-decoding export flow updates in examples/torch_onnx/README.md.
NVIDIA/Model-Optimizer#1808: Related to the TensorRT-Edge-LLM CLI command updates in examples/torch_onnx/README.md.
NVIDIA/Model-Optimizer#1858: Related to MoE router detection and exclude_modules export wiring in modelopt/torch/export/quant_utils.py.

Suggested labels: cherry-pick-0.45.0

Suggested reviewers: cjluo-nv, meenchen, kinjalpatel27

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name	Status	Explanation	Resolution
Title check	❓ Inconclusive	The title identifies a cherry-pick bundle, but the PR numbers alone don't describe the actual changes.	Rename it to a concise summary of the bundled release fixes, e.g. 'Cherry-pick release 0.45 fixes for ONNX quantization, MoE export, PTQ recipes, and docs'.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	No banned load/eval/nosec patterns appeared in the changed Python code, and the added sphinx-argparse dependency is MIT-licensed. citeturn0search0

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch cherry-picks/release-0.45.0

_{Comment @coderabbitai help to get the list of available commands.}

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

github-actions · 2026-07-01T21:55:54Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-07-02 04:56 UTC

Transplant the combined get_model fix from PRs #1839, #1857 and #1869 onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could not be cherry-picked directly because the file was renamed llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main, but the actual fix targets the init_empty_weights / from_config block that already exists on the release branch: - _resolve_init_config: re-derive a built-in config for remote-code checkpoints so device-map inference matches the model definition's version (fixes Nemotron-H moe_latent_size AttributeError on transformers 5.x, #1839). - _get_config_dtype / _apply_dtype_to_config: derive dtype from the resolved config and forward the DeciLM-supported dtype kwarg, dropping unsupported dtype forwarding on the real from_pretrained load (#1857, #1869). Ports the accompanying unit tests (path-adjusted to llm_ptq). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 4

🧹 Nitpick comments (2)

modelopt/torch/quantization/utils/core_utils.py (1)
36-39: 🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Cover the fnuz FP8 variants here too. PyTorch exposes torch.float8_e4m3fnuz and torch.float8_e5m2fnuz, and this codebase already treats them as FP8 inputs in modelopt/torch/quantization/plugins/diffusion/ltx2.py. Leaving them out keeps reduce_amax() exposed to the same unsupported-kernel failure on those weights.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/utils/core_utils.py` around lines 36 - 39, Extend
the FP8 dtype tuple in core_utils so reduce_amax() treats the fnuz variants as
FP8 too; update _FP8_DTYPES to include torch.float8_e4m3fnuz and
torch.float8_e5m2fnuz alongside the existing entries, matching the FP8 handling
already used in ltx2.py. This will ensure the upcast path is applied before amax
reduction for those dtypes and avoid unsupported-kernel failures.
modelopt/torch/quantization/backends/fp8_per_tensor_gemm.py (1)
114-116: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Lazy-import block duplicated across FP8/NVFP4 backends.

The same lazy RealQuantLinear import + comment is repeated verbatim in nvfp4_gemm.py's _nvfp4_availability_check. Consider extracting a small shared helper (e.g., _is_real_quant_linear(module)) in a common backends module to avoid the two call sites drifting out of sync if the import path or justification changes later.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modelopt/torch/quantization/backends/fp8_per_tensor_gemm.py` around lines 114
- 116, The lazy RealQuantLinear import block is duplicated across the FP8 and
NVFP4 backend availability checks, so factor it into a shared helper in the
common backends area instead of keeping two identical call sites. Create a small
utility such as a module-level helper that performs the lazy import and any
needed check, then update fp8_per_tensor_gemm.py and nvfp4_gemm.py to call that
helper from _fp8_availability_check and _nvfp4_availability_check so the import
path and comment stay in one place.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/puzzletron/README.md`:
- Line 259: The Puzzletron setup instructions are missing the prerequisite to
install the optional `puzzletron` extra, which can leave users on the no-op
import path. Restore that prerequisite note in this README or in the linked
lm-eval docs, near the Puzzletron/AnyModel evaluation guidance, so readers know
to install the extra before following the `lm-eval` workflow.

In `@modelopt/onnx/autocast/utils.py`:
- Around line 143-161: The main-graph handling in _clear_callback ignores the
clear_shapes argument and always clears shapes for value_info entries. Update
the else branch in _clear_callback so the _clear call for g.value_info uses the
passed clear_shapes flag, matching the subgraph and output handling and
preserving shapes when callers request type-only inference.

In `@modelopt/onnx/utils.py`:
- Around line 1878-1974: The comment in _reconcile_stale_output_shapes is
inaccurate because SymbolicShapeInference.infer_shapes can mutate the input
model in place. Update the implementation around the ORT symbolic shape
inference call to avoid relying on the original model state being preserved:
either operate on a deep copy of the model before calling
SymbolicShapeInference.infer_shapes, or restore any cleared graph state
afterward so the downstream output-shape reconciliation logic remains correct.

In `@tests/unit/torch/export/test_get_quantization.py`:
- Around line 163-169: The test test_moe_router_names_handle_root_module imports
_get_unquantized_moe_router_names inside the function, which should be moved to
the module-level import section. Update the existing top-level import from
modelopt.torch.export.quant_utils in this test file so the helper is imported
once alongside the other symbols, and remove the in-function import from the
test body.

---

Nitpick comments:
In `@modelopt/torch/quantization/backends/fp8_per_tensor_gemm.py`:
- Around line 114-116: The lazy RealQuantLinear import block is duplicated
across the FP8 and NVFP4 backend availability checks, so factor it into a shared
helper in the common backends area instead of keeping two identical call sites.
Create a small utility such as a module-level helper that performs the lazy
import and any needed check, then update fp8_per_tensor_gemm.py and
nvfp4_gemm.py to call that helper from _fp8_availability_check and
_nvfp4_availability_check so the import path and comment stay in one place.

In `@modelopt/torch/quantization/utils/core_utils.py`:
- Around line 36-39: Extend the FP8 dtype tuple in core_utils so reduce_amax()
treats the fnuz variants as FP8 too; update _FP8_DTYPES to include
torch.float8_e4m3fnuz and torch.float8_e5m2fnuz alongside the existing entries,
matching the FP8 handling already used in ltx2.py. This will ensure the upcast
path is applied before amax reduction for those dtypes and avoid
unsupported-kernel failures.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9e2b3ee0-212f-49a4-a3eb-01637f2de12b

📥 Commits

Reviewing files that changed from the base of the PR and between 56c1416 and 20a4a85.

📒 Files selected for processing (29)

CHANGELOG.rst
examples/llm_eval/README.md
examples/puzzletron/README.md
examples/puzzletron/configs/gptoss-20b_remove_experts_memory/validate_model_defaults.yaml
examples/puzzletron/configs/llama-3_2-3B_pruneffn_memory/validate_model_defaults.yaml
examples/puzzletron/configs/mistral-small-24b-instruct-2501_pruneffn_memory/validate_model_defaults.yaml
examples/puzzletron/configs/nemotron-nano-12b-v2/validate_model_defaults.yaml
examples/puzzletron/configs/qwen2_5_7b_instruct_pruneffn_memory/validate_model_defaults.yaml
examples/puzzletron/configs/qwen3-8b_pruneffn_memory/validate_model_defaults.yaml
examples/torch_onnx/README.md
modelopt/onnx/autocast/convert.py
modelopt/onnx/autocast/precisionconverter.py
modelopt/onnx/autocast/utils.py
modelopt/onnx/utils.py
modelopt/torch/export/quant_utils.py
modelopt/torch/opt/plugins/mcore_dist_checkpointing.py
modelopt/torch/puzzletron/dataset/prepare_dataset.py
modelopt/torch/quantization/backends/fp8_per_tensor_gemm.py
modelopt/torch/quantization/backends/nvfp4_gemm.py
modelopt/torch/quantization/utils/core_utils.py
modelopt_recipes/configs/ptq/units/default_disabled_quantizers.yaml
modelopt_recipes/general/ptq/nvfp4_mlp_only-novit-kv_fp8.yaml
modelopt_recipes/ptq.md
pyproject.toml
tests/unit/onnx/autocast/test_autocast.py
tests/unit/onnx/test_onnx_utils.py
tests/unit/recipe/test_loader.py
tests/unit/torch/export/test_get_quantization.py
tests/unit/torch/quantization/test_utils.py

coderabbitai · 2026-07-01T22:03:13Z

-```
-
-For a quick smoke test, add `--limit 10`.
+Evaluate AnyModel checkpoints using lm-eval. See the [LM-Eval-Harness section](../llm_eval/README.md#lm-eval-harness) in `examples/llm_eval/README.md` for full instructions, including multi-GPU and Slurm setup.


🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Keep the Puzzletron prerequisite here or in the linked docs.

The removed subsection was the only place that told readers to install the puzzletron extra. Without that note, Puzzletron users can miss the optional dependency and end up on the no-op import path.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/puzzletron/README.md` at line 259, The Puzzletron setup instructions are missing the prerequisite to install the optional `puzzletron` extra, which can leave users on the no-op import path. Restore that prerequisite note in this README or in the linked lm-eval docs, near the Puzzletron/AnyModel evaluation guidance, so readers know to install the extra before following the `lm-eval` workflow.

coderabbitai · 2026-07-01T22:03:13Z

+    def _clear_callback(g: onnx.GraphProto, parent: onnx.NodeProto, is_sub: bool) -> None:
+        logger.debug(f"Clearing types/shapes in {'subgraph' if is_sub else 'main graph'}: {g.name}")
+
+        if is_sub:
+            # Subgraph inputs are cleared so they propagate from the parent graph.
+            for inp in g.input:
+                if inp.type.HasField("tensor_type"):
+                    _clear(inp, clear_shapes)
+            # Only clear value_info for intermediates produced within this subgraph.
+            subgraph_outputs = {out for node in g.node for out in node.output}
+            for vi in g.value_info:
+                if vi.name in subgraph_outputs:
+                    _clear(vi, clear_shapes)
+        else:
+            for vi in g.value_info:
+                _clear(vi, clear_shape=True)
+
+        for out in g.output:
+            _clear(out, clear_shapes)


🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Main-graph value_info clearing ignores the clear_shapes parameter.

In the else (main graph) branch, _clear(vi, clear_shape=True) is hardcoded, while the subgraph branch and the output-clearing loop correctly use the clear_shapes parameter. This means when a caller passes clear_shapes=False (e.g. PrecisionConverter with use_standalone_type_inference=True, intending to preserve shapes for type-only inference), main-graph intermediate shapes still get clobbered to "unk" — contradicting the function's own docstring ("clear_shapes: If True, also clear shapes (False keeps shapes for type-only inference)").

🐛 Proposed fix

else: for vi in g.value_info: - _clear(vi, clear_shape=True) + _clear(vi, clear_shapes)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _clear_callback(g: onnx.GraphProto, parent: onnx.NodeProto, is_sub: bool) -> None:

logger.debug(f"Clearing types/shapes in {'subgraph' if is_sub else 'main graph'}: {g.name}")

if is_sub:

# Subgraph inputs are cleared so they propagate from the parent graph.

for inp in g.input:

if inp.type.HasField("tensor_type"):

_clear(inp, clear_shapes)

# Only clear value_info for intermediates produced within this subgraph.

subgraph_outputs = {out for node in g.node for out in node.output}

for vi in g.value_info:

if vi.name in subgraph_outputs:

_clear(vi, clear_shapes)

else:

for vi in g.value_info:

_clear(vi, clear_shape=True)

for out in g.output:

_clear(out, clear_shapes)

def _clear_callback(g: onnx.GraphProto, parent: onnx.NodeProto, is_sub: bool) -> None:

logger.debug(f"Clearing types/shapes in {'subgraph' if is_sub else 'main graph'}: {g.name}")

if is_sub:

# Subgraph inputs are cleared so they propagate from the parent graph.

for inp in g.input:

if inp.type.HasField("tensor_type"):

_clear(inp, clear_shapes)

# Only clear value_info for intermediates produced within this subgraph.

subgraph_outputs = {out for node in g.node for out in node.output}

for vi in g.value_info:

if vi.name in subgraph_outputs:

_clear(vi, clear_shapes)

else:

for vi in g.value_info:

_clear(vi, clear_shapes)

for out in g.output:

_clear(out, clear_shapes)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/autocast/utils.py` around lines 143 - 161, The main-graph handling in _clear_callback ignores the clear_shapes argument and always clears shapes for value_info entries. Update the else branch in _clear_callback so the _clear call for g.value_info uses the passed clear_shapes flag, matching the subgraph and output handling and preserving shapes when callers request type-only inference.

coderabbitai · 2026-07-01T22:03:13Z

+def _reconcile_stale_output_shapes(model: onnx.ModelProto) -> int:
+    """Re-derive stale ``graph.output`` shapes from the operator graph.
+
+    Weakly-typed models (e.g. exported from TensorFlow) can declare an output rank
+    that conflicts with the graph topology -- most commonly a leftover rank-0
+    (scalar) annotation on a tensor that is really rank-2+. Such a stale rank poisons
+    downstream shape inference: ORT fails while augmenting the model for INT8
+    calibration (``axis must be in [-rank, rank-1]. Input rank was 0``), and
+    ``onnx.shape_inference`` with ``strict_mode=True`` raises ``Inferred shape and
+    existing shape differ in rank`` during fp16 autocast.
+
+    Strategy: snapshot the declared output shapes, clear them, and re-derive them from
+    the operator graph -- preferring ORT's symbolic shape inference (it resolves ops
+    such as ``TopK`` that ONNX's static inference gives up on) and falling back to the
+    size-aware ``infer_shapes`` wrapper. A declared shape is only overwritten when it is
+    genuinely stale -- a rank mismatch (the rank-0-vs-rank-N bug) or a conflicting
+    concrete dimension. Outputs that merely differ in symbolic ``dim_param`` names (e.g.
+    a re-derived ``unk__0`` vs a declared ``batch``) keep their original declaration, so
+    healthy models -- including dynamic batch/sequence dims -- are left untouched. A
+    graph output is never left without a shape (``onnx.checker`` requires the field).
+
+    Args:
+        model: Loaded in-memory onnx ModelProto, ideally with ``value_info`` already
+            cleared so re-inference derives shapes from the operator graph.
+
+    Returns:
+        Number of graph outputs whose shape was changed.
+    """
+    outputs = model.graph.output
+    if not outputs:
+        return 0
+
+    def _outputs_with_shapes(m: onnx.ModelProto) -> dict[str, onnx.TensorShapeProto]:
+        return {
+            o.name: o.type.tensor_type.shape
+            for o in m.graph.output
+            if o.type.tensor_type.HasField("shape")
+        }
+
+    def _is_stale(declared: onnx.TensorShapeProto | None, inferred: onnx.TensorShapeProto | None):
+        # Only treat a declaration as stale when inference contradicts it: a different
+        # rank, or a concrete dim that disagrees with an inferred concrete dim. A missing
+        # declaration is "stale" (adopt whatever was inferred); a missing inference is not
+        # (keep the declaration). Symbolic dim_param renames are intentionally ignored.
+        if inferred is None:
+            return False
+        if declared is None:
+            return True
+        if len(declared.dim) != len(inferred.dim):
+            return True
+        return any(
+            d.HasField("dim_value") and i.HasField("dim_value") and d.dim_value != i.dim_value
+            for d, i in zip(declared.dim, inferred.dim)
+        )
+
+    # Snapshot declared shapes, then clear them so re-inference starts from the
+    # topology instead of being biased by the stale annotations.
+    declared: dict[str, onnx.TensorShapeProto | None] = {}
+    for o in outputs:
+        tt = o.type.tensor_type
+        if tt.HasField("shape"):
+            snapshot = onnx.TensorShapeProto()
+            snapshot.CopyFrom(tt.shape)
+            declared[o.name] = snapshot
+        else:
+            declared[o.name] = None
+        tt.ClearField("shape")
+
+    # Re-derive output shapes from the cleared model (neither inference call mutates it):
+    # prefer ORT symbolic shape inference, then fall back to the size-aware infer_shapes
+    # wrapper if it is unavailable or yields nothing.
+    inferred: dict[str, onnx.TensorShapeProto] = {}
+    try:
+        from onnxruntime.tools.symbolic_shape_infer import SymbolicShapeInference
+
+        inferred = _outputs_with_shapes(SymbolicShapeInference.infer_shapes(model, auto_merge=True))
+    except Exception as e:
+        logger.debug("Symbolic shape inference unavailable/failed: %s", e)
+    if not inferred:
+        try:
+            inferred = _outputs_with_shapes(infer_shapes(model, strict_mode=False, data_prop=True))
+        except Exception as e:
+            logger.debug("ONNX shape inference for output reconciliation failed: %s", e)
+
+    changed = 0
+    for o in outputs:
+        decl = declared[o.name]
+        inf = inferred.get(o.name)
+        # Adopt the inferred shape only when the declaration is genuinely stale; otherwise
+        # restore the declared shape (never leaving a graph output shapeless).
+        if _is_stale(decl, inf):
+            o.type.tensor_type.shape.CopyFrom(inf)
+            changed += 1
+        elif decl is not None:
+            o.type.tensor_type.shape.CopyFrom(decl)
+    return changed
+


🗄️ Data Integrity & Integration | 🟡 Minor | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

Does onnxruntime SymbolicShapeInference.infer_shapes mutate the input ModelProto in place or return a new copy?

💡 Result:

The onnxruntime.tools.symbolic_shape_infer.SymbolicShapeInference.infer_shapes function does not return a new copy of the ModelProto; it performs operations that can modify the input ModelProto object [1]. In the implementation of SymbolicShapeInference, the input ModelProto is used to initialize the state [2][3]. Specifically, within the preprocess method, a copy of the model is made using out_mp.CopyFrom(in_mp) for internal use, but the downstream processes and the way the tool is commonly invoked in practice often result in the original model object being mutated [2][1]. Evidence from user reports indicates that calling SymbolicShapeInference.infer_shapes on a model object can lead to changes being persisted in that same object, which is a known source of side effects [4][1]. If you need to preserve the original model, it is best practice to pass a deep copy of the ModelProto or explicitly load the model again if necessary [1].

Citations:

1: KeyError when calling onnxruntime.tools.symbolic_shape_infer.SymbolicShapeInference on model containing Loop microsoft/onnxruntime#24495

2: https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/symbolic_shape_infer.py

3: https://github.com/microsoft/onnxruntime/blob/eaf8d943/onnxruntime/python/tools/transformers/shape_infer_helper.py

4: Remove mutable arguments from symbolic_shape_infer microsoft/onnxruntime#5166

🌐 Web query:

onnxruntime.tools.symbolic_shape_infer SymbolicShapeInference.infer_shapes mutates input ModelProto in place source

💡 Result:

The SymbolicShapeInference.infer_shapes method in ONNX Runtime generally performs its operations in a way that modifies the provided ModelProto object [1][2]. Users typically pass a ModelProto, and the method performs symbolic shape inference, updating the graph's value information (shapes and types) within that model object [1][2]. While Python objects are passed by reference and the tool is designed to work on the model structure, it is common practice to treat the resulting object as the updated model [2]. Developers often use the returned model or save the object after the call to ensure the changes are captured [2]. It is worth noting that historical development within the ONNX Runtime repository has highlighted the risks of mutable arguments in such utility functions, with efforts made to manage state explicitly [3]. Therefore, while the operation often mutates the model in place, you should rely on the returned ModelProto object to be safe [1][2].

Citations:

1: https://github.com/microsoft/onnxruntime/blob/master/onnxruntime/test/python/onnxruntime_test_python_symbolic_shape_infer.py

2: KeyError when calling onnxruntime.tools.symbolic_shape_infer.SymbolicShapeInference on model containing Loop microsoft/onnxruntime#24495

3: Remove mutable arguments from symbolic_shape_infer microsoft/onnxruntime#5166

🏁 Script executed:

#!/bin/bash set -euo pipefail # Inspect the reconciliation helper and its call sites to see whether symbolic # shape inference side effects are observable or already undone by restoration. file="modelopt/onnx/utils.py" sed -n '1840,2035p' "$file" printf '\n--- call sites ---\n' rg -n "_reconcile_stale_output_shapes|clear_stale_value_info|SymbolicShapeInference\.infer_shapes|infer_shapes$model, strict_mode=False, data_prop=True$" "$file"

Repository: NVIDIA/Model-Optimizer

Length of output: 8750

ORT symbolic shape inference mutates the input model in place The comment here is inaccurate. The output-shape restore logic still works, but if preserving the cleared value_info state matters, deep-copy model before calling ORT or clear it again afterward.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/onnx/utils.py` around lines 1878 - 1974, The comment in _reconcile_stale_output_shapes is inaccurate because SymbolicShapeInference.infer_shapes can mutate the input model in place. Update the implementation around the ORT symbolic shape inference call to avoid relying on the original model state being preserved: either operate on a deep copy of the model before calling SymbolicShapeInference.infer_shapes, or restore any cleared graph state afterward so the downstream output-shape reconciliation logic remains correct.

coderabbitai · 2026-07-01T22:03:13Z

+def test_moe_router_names_handle_root_module():
+    """When the MoE block itself is the root module, router names have no leading dot."""
+    from modelopt.torch.export.quant_utils import _get_unquantized_moe_router_names
+
+    block = _FakeMoEBlock(hidden=16)
+    # name == "" for the root module; the router must be "gate", not ".gate".
+    assert _get_unquantized_moe_router_names(block) == ["gate"]


📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Move import to top of file.

_get_unquantized_moe_router_names is imported inside the test function without justification (no circular-import or optional-dependency reason). Per path instructions, imports should be at module top so import errors surface at collection time.

As per path instructions: "Imports inside functions or test methods without explicit justification... Imports belong at the top of the file so import errors surface at collection time, not mid-test."

♻️ Proposed fix

-def test_moe_router_names_handle_root_module(): - """When the MoE block itself is the root module, router names have no leading dot.""" - from modelopt.torch.export.quant_utils import _get_unquantized_moe_router_names - - block = _FakeMoEBlock(hidden=16) +def test_moe_router_names_handle_root_module(): + """When the MoE block itself is the root module, router names have no leading dot.""" + block = _FakeMoEBlock(hidden=16) # name == "" for the root module; the router must be "gate", not ".gate". assert _get_unquantized_moe_router_names(block) == ["gate"]

(add _get_unquantized_moe_router_names to the existing top-level import from modelopt.torch.export.quant_utils in this file.)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/torch/export/test_get_quantization.py` around lines 163 - 169, The test test_moe_router_names_handle_root_module imports _get_unquantized_moe_router_names inside the function, which should be moved to the module-level import section. Update the existing top-level import from modelopt.torch.export.quant_utils in this test file so the helper is imported once alongside the other symbols, and remove the in-function import from the test body.

Source: Path instructions

codecov · 2026-07-01T22:09:08Z

Codecov Report

❌ Patch coverage is 85.00000% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 76.98%. Comparing base (faaf9f4) to head (86b500f).
⚠️ Report is 1 commits behind head on release/0.45.0.

Files with missing lines	Patch %	Lines
modelopt/onnx/utils.py	80.00%	10 Missing ⚠️
modelopt/onnx/autocast/utils.py	86.95%	3 Missing ⚠️
modelopt/torch/export/quant_utils.py	88.88%	2 Missing ⚠️

Additional details and impacted files

@@                Coverage Diff                 @@
##           release/0.45.0    #1880      +/-   ##
==================================================
- Coverage           77.33%   76.98%   -0.35%     
==================================================
  Files                 504      504              
  Lines               55420    55479      +59     
==================================================
- Hits                42861    42713     -148     
- Misses              12559    12766     +207

Flag	Coverage Δ
examples	`42.56% <24.00%> (-0.38%)`	⬇️
gpu	`57.76% <73.00%> (-1.35%)`	⬇️
regression	`14.63% <2.00%> (-0.19%)`	⬇️
unit	`54.47% <82.00%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mxinO and others added 10 commits July 1, 2026 14:33

kevalmorabia97 requested review from a team as code owners July 1, 2026 21:45

kevalmorabia97 requested review from Edwardf0t1, ajrasane and realAsma and removed request for a team and ajrasane July 1, 2026 21:45

kevalmorabia97 removed request for Edwardf0t1 and realAsma July 1, 2026 21:46

build(docs): pin sphinx-argparse<0.6.0 to fix doc build error

20a4a85

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

kevalmorabia97 requested a review from a team as a code owner July 1, 2026 21:47

kevalmorabia97 changed the title ~~[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858~~ [Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869 Jul 1, 2026

coderabbitai Bot reviewed Jul 1, 2026

View reviewed changes

kevalmorabia97 merged commit 85d5201 into release/0.45.0 Jul 2, 2026
53 of 56 checks passed

kevalmorabia97 deleted the cherry-picks/release-0.45.0 branch July 2, 2026 04:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869#1880

[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869#1880
kevalmorabia97 merged 12 commits into
release/0.45.0from
cherry-picks/release-0.45.0

kevalmorabia97 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

coderabbitai Bot Jul 1, 2026

Uh oh!

codecov Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

kevalmorabia97 commented Jul 1, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cherry-picked PRs

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 inconclusive)

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

codecov Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

kevalmorabia97 commented Jul 1, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jul 1, 2026 •

edited

Loading

github-actions Bot commented Jul 1, 2026 •

edited

Loading

codecov Bot commented Jul 1, 2026 •

edited

Loading