NVIDIA · kevalmorabia97 · Jul 2, 2026 · Jun 23, 2026 · Jun 23, 2026 · Jun 24, 2026
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -81,13 +81,16 @@ Changelog
 
 **Bug Fixes**
 
+- Fix ``ShapeInferenceError`` during ONNX INT8 + FP16 quantization (``--high_precision_dtype fp16``) of weakly-typed models (e.g. TensorFlow exports) that carry stale rank-0 ``graph.output`` shapes or ops such as ``TopK`` that ONNX's static shape inference cannot resolve. ``clear_stale_value_info`` now reconciles stale output shapes via symbolic shape inference (keeping every output's shape field populated), and AutoCast runs ONNX shape inference in strict mode and falls back to schema-based standalone type inference when it fails, so unresolved ops no longer leave tensors untyped.
+- Always list unquantized MoE routers/gates in the exported ``exclude_modules`` (NVBug 5718750). ``get_quant_config`` only recorded modules that carry a quantizer, but on ``transformers>=5.0`` MoE routers are no longer ``nn.Linear`` (e.g. ``TopKRouter``) and never receive one, so the BF16 router weight was written to the checkpoint yet omitted from ``exclude_modules``. vLLM / SGLang then treated it as quantized and failed to load (e.g. Qwen3-30B-A3B NVFP4: ``AssertionError: Tried to load weights of size [128, 2048] to a parameter of size [128, 1024]``). Routers are now detected structurally (an MoE block with an ``experts`` container plus a weight-bearing ``gate`` / ``router`` / ``shared_expert_gate`` submodule) and recorded as unquantized regardless of quantizer attachment.
 - In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
 - Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
 - Fix ONNX AutoCast ``keep_io_types=True`` sanity-check failure (``Unexpected type in I/O tensor ...``) when a network input/output is an empty tensor (a dimension of size 0). Such tensors were "fake-cast" (retyped in place) to the low precision type; because the value-info map aliases the ``graph.input``/``graph.output`` ``ValueInfoProto``, this silently changed the model's I/O type. AutoCast now inserts a real ``Cast`` for protected I/O tensors instead.
 - Fix INT8 entropy calibration of fp16 ONNX models raising ``ValueError: Too many bins for data range`` on numpy >= 2.0. ``_collect_value`` in ``modelopt.onnx.quantization.ort_patching`` now casts the histogram range endpoints to Python float so bin edges are computed in float64, instead of inheriting the fp16 dtype of an activation tensor with a small range (which collapsed the 128-bin linspace under NEP-50 promotion).
 - Fix the GPT-OSS MXFP4 → NVFP4 PTQ path in ``examples/llm_ptq/hf_ptq.py`` (used with ``--cast_mxfp4_to_nvfp4``). ``get_model`` now loads native MXFP4 checkpoints (``openai/gpt-oss-*``) dequantized to BF16 ``GptOssExperts`` via ``Mxfp4Config(dequantize=True)`` on a sequential device map. This fixes a CUDA illegal-memory access during the multi-GPU dequant load and the ``NotImplementedError`` for experts type ``Mxfp4GptOssExperts`` during unified HF export (the packed-kernel experts wrapper, used when the optional ``kernels`` package is installed, is unsupported by export); ``kernels`` is no longer required. The ``--cast_mxfp4_to_nvfp4`` step now also resolves a HF Hub ID ``--pyt_ckpt_path`` to its local snapshot directory instead of failing with ``FileNotFoundError``.
 - Fix ``_QuantGptOssExperts`` / ``_QuantLlama4TextExperts`` static-block NVFP4 weight calibration raising ``ValueError: Input shape has changed`` during the calibration forward. These experts quantize their weights transposed (``_transposed_quantize``); ``iter_weights_for_calibration`` now yields the same transposed view so weight-only calibration and the forward agree on the block-quant shape (and the export ``_amax`` orientation).
 - Fix unified HF checkpoint export for Llama4 MoE models. The uncalibrated-experts input-quantizer ``amax`` fallback in ``_export_transformers_checkpoint`` special-cased only ``QuantGptOssExperts``; ``QuantLlama4TextExperts`` uses the same fused ``gate_up_proj`` / ``down_proj`` layout and is now handled by the same branch, fixing the export failure.
+- Fix ``NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'`` during quantization calibration of models with natively FP8 (``float8_e4m3fn`` / ``float8_e5m2``) weights, such as DeepSeek-V3. FP8 dtypes implement no reduction (``max``/``amax``), ``abs``, or elementwise ``maximum`` kernels, so ``reduce_amax`` now upcasts FP8 inputs to the default float dtype before reducing; the upcast is lossless and only affects the FP8 path.
 
 0.44 (2026-05-14)
 ^^^^^^^^^^^^^^^^^

@@ -16,18 +16,24 @@ The supported eval tasks are [here](https://github.com/EleutherAI/lm-evaluation-
 
 ### Baseline
 
+Both standard HuggingFace models and heterogeneous pruned checkpoints produced by Puzzletron are supported.
+
 - For models which fit on a single GPU:
 
 ```sh
 python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card> --tasks <comma separated tasks> --batch_size 4
 ```
 
+For a quick smoke test, add `--limit 10` to any of the above commands to evaluate on only 10 samples.
+
 - With model-sharding (for models which require multiple GPUs):
 
 ```sh
 python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card>,parallelize=True --tasks <comma separated tasks> --batch_size 4
 ```
 
+> **Note (Slurm interactive nodes):** On Slurm interactive nodes, `WORLD_SIZE` is set to the number of available GPUs in the shell environment. Running `python` directly causes `lm_eval` to hang waiting for peer ranks that were never spawned. Prepend `WORLD_SIZE=1` to the `python` commands above to fix this. This does not limit GPU usage — `parallelize=True` independently enables model parallelism across all available GPUs within the single process. The `accelerate launch` command manages `WORLD_SIZE` itself and does not require this workaround.
+
 - For data-parallel evaluation with model-sharding:
 
 With the following command, the model will be sharded across `total_num_of_available_gpus/num_copies_of_your_model` with a data-parallelism of `num_copies_of_your_model`
@@ -40,22 +46,6 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
     --batch_size 4
 ```
 
-### Heterogeneous Pruned Checkpoints (Puzzletron)
-
-Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:
-
-```sh
-python lm_eval_hf.py --model hf \
-    --model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
-    --tasks mmlu \
-    --num_fewshot 5 \
-    --batch_size 4
-```
-
-For a quick smoke test, add `--limit 10`.
-
-> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).
-
 ### Quantized (simulated)
 
 - For simulated quantization with any of the default quantization formats:

@@ -552,6 +552,44 @@ def get_original_hf_quant_method(config) -> str | None:
     return None
 
 
+def _resolve_init_config(hf_config, auto_model_module, ckpt_path, config_kwargs):
+    """Re-derive a built-in config when a remote-code config is used with a built-in model
+    class, so it matches the model definition's version; fall back to hf_config otherwise.
+    """
+    if auto_model_module in [AutoModelForCausalLM, AutoModel]:
+        return hf_config
+    if not type(hf_config).__module__.startswith("transformers_modules"):
+        return hf_config
+    builtin_config_kwargs = {k: v for k, v in config_kwargs.items() if k != "trust_remote_code"}
+    try:
+        return AutoConfig.from_pretrained(ckpt_path, **builtin_config_kwargs)
+    except Exception as e:
+        warnings.warn(
+            f"Could not re-derive a built-in config for {ckpt_path} ({e}); using the "
+            "remote-code config for device-map inference."
+        )
+        return hf_config
+
+
+def _get_config_dtype(config):
+    config_dtype = (
+        getattr(config, "dtype", None) or getattr(config, "torch_dtype", None) or torch.bfloat16
+    )
+    if isinstance(config_dtype, str):
+        config_dtype = getattr(torch, config_dtype)
+    return config_dtype
+
+
+def _apply_dtype_to_config(model_kwargs, config_dtype, architecture, apply_config_dtype=False):
+    model_kwargs = model_kwargs.copy()
+    if "DeciLM" in architecture:
+        model_kwargs["torch_dtype"] = config_dtype
+        model_kwargs.pop("dtype", None)
+    elif apply_config_dtype:
+        model_kwargs["dtype"] = config_dtype
+    return model_kwargs
+
+
 def get_model(
     ckpt_path,
     device="cuda",
@@ -701,16 +739,21 @@ def has_pack_quantized_config(config):
                 auto_model_module = getattr(transformers, architecture)
                 from_config = auto_model_module._from_config
 
+            config_for_init = _resolve_init_config(
+                hf_config, auto_model_module, ckpt_path, config_kwargs
+            )
+
             with init_empty_weights(include_buffers=True):
                 # When computing the device_map, assuming bfloat16 precision by default,
                 # unless specified by the hf_config.
-                torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
-                model_kwargs2 = model_kwargs.copy()
+                config_dtype = _get_config_dtype(config_for_init)
+                model_kwargs2 = _apply_dtype_to_config(
+                    model_kwargs, config_dtype, architecture, apply_config_dtype=True
+                )
                 if auto_model_module not in [AutoModelForCausalLM, AutoModel]:
                     model_kwargs2.pop("trust_remote_code", None)
-                model_kwargs2["dtype"] = torch_dtype
                 model_kwargs2.pop("max_memory", None)
-                model = from_config(hf_config, **model_kwargs2)
+                model = from_config(config_for_init, **model_kwargs2)
 
             max_memory = get_max_memory()
             inferred_device_map = infer_auto_device_map(model, max_memory=max_memory)
@@ -730,10 +773,11 @@ def has_pack_quantized_config(config):
                 )
                 model_kwargs["max_memory"] = max_memory
 
+            model_kwargs2 = _apply_dtype_to_config(model_kwargs, config_dtype, architecture)
             model = auto_model_module.from_pretrained(
                 ckpt_path,
                 device_map=device_map,
-                **model_kwargs,
+                **model_kwargs2,
             )
     model.eval()
     if has_pack_quantized_config(hf_config):

@@ -256,18 +256,7 @@ The plot shows how token accuracy changes with different compression rates. High
 
 ## Evaluation
 
-Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly.
-
-```bash
-python examples/llm_eval/lm_eval_hf.py \
-   --model hf \
-   --model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
-   --tasks mmlu \
-   --num_fewshot 5 \
-   --batch_size 4
-```
-
-For a quick smoke test, add `--limit 10`.
+Evaluate AnyModel checkpoints using lm-eval. See the [LM-Eval-Harness section](../llm_eval/README.md#lm-eval-harness) in `examples/llm_eval/README.md` for full instructions, including multi-GPU and Slurm setup.
 
 > **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
 > see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
 block_size: 8192
 bos_rate: 0.5
 data_column: messages
-val_dataset_name: valid
+val_dataset_name: validation
 shuffle_seed: 81436
 seed: 42
 fim_rate: 0

@@ -122,8 +122,8 @@ source venv/bin/activate
 pip3 install .
 
 # Verify installation
-tensorrt-edgellm-quantize-llm --help
-tensorrt-edgellm-export-llm --help
+tensorrt-edgellm-quantize --help
+tensorrt-edgellm-export --help
 ```
 
 **System requirements:**
@@ -137,76 +137,67 @@ tensorrt-edgellm-export-llm --help
 
 | Tool | Purpose |
 | :--- | :--- |
-| `tensorrt-edgellm-quantize-llm` | Quantize LLM models using ModelOpt (FP8, INT4 AWQ, NVFP4) |
-| `tensorrt-edgellm-export-llm` | Export LLM to ONNX with precision-specific optimizations |
-| `tensorrt-edgellm-export-visual` | Export visual encoders for multimodal VLM models |
-| `tensorrt-edgellm-quantize-draft` | Quantize EAGLE draft models for speculative decoding |
-| `tensorrt-edgellm-export-draft` | Export EAGLE draft models to ONNX |
+| `tensorrt-edgellm-quantize` | Quantize models using ModelOpt (FP8, INT4 AWQ, NVFP4); subcommands: `llm`, `draft` |
+| `tensorrt-edgellm-export` | Export quantized or FP16/BF16 checkpoint to ONNX; auto-detects VLM and audio components |
 | `tensorrt-edgellm-insert-lora` | Insert LoRA patterns into existing ONNX models |
 | `tensorrt-edgellm-process-lora` | Process LoRA adapter weights for runtime loading |
 
 ### Example: Quantize and Export an LLM
 
 ```bash
 # Step 1: Quantize with ModelOpt
-tensorrt-edgellm-quantize-llm \
+tensorrt-edgellm-quantize llm \
     --model_dir Qwen/Qwen2.5-3B-Instruct \
     --quantization fp8 \
     --output_dir quantized/qwen2.5-3b-fp8
 
 # Step 2: Export to ONNX
-tensorrt-edgellm-export-llm \
-    --model_dir quantized/qwen2.5-3b-fp8 \
-    --output_dir onnx_models/qwen2.5-3b
+tensorrt-edgellm-export \
+    quantized/qwen2.5-3b-fp8 \
+    onnx_models/qwen2.5-3b
 ```
 
 ### Example: Quantize and Export a VLM
 
 ```bash
-# Quantize the language model component
-tensorrt-edgellm-quantize-llm \
+# Quantize with ModelOpt (handles both LLM and visual components)
+tensorrt-edgellm-quantize llm \
     --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
     --quantization fp8 \
     --output_dir quantized/qwen2.5-vl-3b
 
-# Export the language model
-tensorrt-edgellm-export-llm \
-    --model_dir quantized/qwen2.5-vl-3b \
-    --output_dir onnx_models/qwen2.5-vl-3b/llm
-
-# Export the visual encoder
-tensorrt-edgellm-export-visual \
-    --model_dir Qwen/Qwen2.5-VL-3B-Instruct \
-    --output_dir onnx_models/qwen2.5-vl-3b/visual
+# Export to ONNX (auto-detects VLM and exports LLM + visual encoder to separate subdirs)
+tensorrt-edgellm-export \
+    quantized/qwen2.5-vl-3b \
+    onnx_models/qwen2.5-vl-3b
 ```
 
 ### Example: EAGLE Speculative Decoding
 
 ```bash
 # Quantize base model
-tensorrt-edgellm-quantize-llm \
+tensorrt-edgellm-quantize llm \
     --model_dir meta-llama/Llama-3.1-8B-Instruct \
     --quantization fp8 \
     --output_dir quantized/llama3.1-8b-base
 
 # Export base model with EAGLE flag
-tensorrt-edgellm-export-llm \
-    --model_dir quantized/llama3.1-8b-base \
-    --output_dir onnx_models/llama3.1-8b/base \
-    --is_eagle_base
+tensorrt-edgellm-export \
+    quantized/llama3.1-8b-base \
+    onnx_models/llama3.1-8b/base \
+    --eagle-base
 
 # Quantize EAGLE draft model
-tensorrt-edgellm-quantize-draft \
+tensorrt-edgellm-quantize draft \
     --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
     --draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
     --quantization fp8 \
     --output_dir quantized/llama3.1-8b-draft
 
 # Export draft model
-tensorrt-edgellm-export-draft \
-    --draft_model_dir quantized/llama3.1-8b-draft \
-    --base_model_dir meta-llama/Llama-3.1-8B-Instruct \
-    --output_dir onnx_models/llama3.1-8b/draft
+tensorrt-edgellm-export \
+    quantized/llama3.1-8b-draft \
+    onnx_models/llama3.1-8b/draft
 ```
 
 ### Quantization Methods

@@ -136,8 +136,10 @@ def convert_to_mixed_precision(
     graph_sanitizer.sanitize()
     model = graph_sanitizer.model
 
-    # Setup internal mappings
-    model = onnx_utils.infer_types(model, use_standalone_type_inference)
+    # Setup internal mappings. Use strict shape inference so an op ONNX cannot resolve surfaces
+    # as an exception (triggering infer_types' standalone type-inference fallback) instead of
+    # silently leaving tensors untyped, which would break later type lookups.
+    model = onnx_utils.infer_types(model, use_standalone_type_inference, strict_mode=True)
     value_info_map, initializer_map, node_to_init_map = utils.setup_mappings(model)
 
     # Automatically add 'trt' to list of providers if custom ops are detected
@@ -267,8 +269,10 @@ def convert_to_f16(
     sanitizer.convert_fp64_to_fp32()
     model = sanitizer.model
 
-    # Setup internal mappings
-    model = onnx_utils.infer_types(model, use_standalone_type_inference)
+    # Setup internal mappings. Use strict shape inference so an op ONNX cannot resolve surfaces
+    # as an exception (triggering infer_types' standalone type-inference fallback) instead of
+    # silently leaving tensors untyped, which would break later type lookups.
+    model = onnx_utils.infer_types(model, use_standalone_type_inference, strict_mode=True)
     value_info_map, initializer_map, node_to_init_map = utils.setup_mappings(model)
 
     precision_converter = PrecisionConverter(