Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -81,13 +81,16 @@ Changelog

**Bug Fixes**

- Fix ``ShapeInferenceError`` during ONNX INT8 + FP16 quantization (``--high_precision_dtype fp16``) of weakly-typed models (e.g. TensorFlow exports) that carry stale rank-0 ``graph.output`` shapes or ops such as ``TopK`` that ONNX's static shape inference cannot resolve. ``clear_stale_value_info`` now reconciles stale output shapes via symbolic shape inference (keeping every output's shape field populated), and AutoCast runs ONNX shape inference in strict mode and falls back to schema-based standalone type inference when it fails, so unresolved ops no longer leave tensors untyped.
- Always list unquantized MoE routers/gates in the exported ``exclude_modules`` (NVBug 5718750). ``get_quant_config`` only recorded modules that carry a quantizer, but on ``transformers>=5.0`` MoE routers are no longer ``nn.Linear`` (e.g. ``TopKRouter``) and never receive one, so the BF16 router weight was written to the checkpoint yet omitted from ``exclude_modules``. vLLM / SGLang then treated it as quantized and failed to load (e.g. Qwen3-30B-A3B NVFP4: ``AssertionError: Tried to load weights of size [128, 2048] to a parameter of size [128, 1024]``). Routers are now detected structurally (an MoE block with an ``experts`` container plus a weight-bearing ``gate`` / ``router`` / ``shared_expert_gate`` submodule) and recorded as unquantized regardless of quantizer attachment.
- In Megatron-Core only do EP amax sync for routed expert weights if ``sync_expert_weight_amax=True``. Previously EP amax sync would sync routed expert weights across EP ranks even when ``sync_expert_weight_amax`` was False.
- Fix Megatron-Core HF importer to load fused ``TELayerNormColumnParallelLinear.layer_norm_weight`` from HF for GPT-family models (Qwen3 etc.) under ``--export-default-te-spec``. Importer now prefers per-context keys ``fused_input_layernorm`` / ``fused_pre_mlp_layernorm`` (fallback ``fused_norm`` for Nemotron-H backward compatibility); ``mcore_qwen.py`` provides the new rules. Without this fix, post-prune MMLU sat at chance.
- Fix ONNX AutoCast ``keep_io_types=True`` sanity-check failure (``Unexpected type in I/O tensor ...``) when a network input/output is an empty tensor (a dimension of size 0). Such tensors were "fake-cast" (retyped in place) to the low precision type; because the value-info map aliases the ``graph.input``/``graph.output`` ``ValueInfoProto``, this silently changed the model's I/O type. AutoCast now inserts a real ``Cast`` for protected I/O tensors instead.
- Fix INT8 entropy calibration of fp16 ONNX models raising ``ValueError: Too many bins for data range`` on numpy >= 2.0. ``_collect_value`` in ``modelopt.onnx.quantization.ort_patching`` now casts the histogram range endpoints to Python float so bin edges are computed in float64, instead of inheriting the fp16 dtype of an activation tensor with a small range (which collapsed the 128-bin linspace under NEP-50 promotion).
- Fix the GPT-OSS MXFP4 → NVFP4 PTQ path in ``examples/llm_ptq/hf_ptq.py`` (used with ``--cast_mxfp4_to_nvfp4``). ``get_model`` now loads native MXFP4 checkpoints (``openai/gpt-oss-*``) dequantized to BF16 ``GptOssExperts`` via ``Mxfp4Config(dequantize=True)`` on a sequential device map. This fixes a CUDA illegal-memory access during the multi-GPU dequant load and the ``NotImplementedError`` for experts type ``Mxfp4GptOssExperts`` during unified HF export (the packed-kernel experts wrapper, used when the optional ``kernels`` package is installed, is unsupported by export); ``kernels`` is no longer required. The ``--cast_mxfp4_to_nvfp4`` step now also resolves a HF Hub ID ``--pyt_ckpt_path`` to its local snapshot directory instead of failing with ``FileNotFoundError``.
- Fix ``_QuantGptOssExperts`` / ``_QuantLlama4TextExperts`` static-block NVFP4 weight calibration raising ``ValueError: Input shape has changed`` during the calibration forward. These experts quantize their weights transposed (``_transposed_quantize``); ``iter_weights_for_calibration`` now yields the same transposed view so weight-only calibration and the forward agree on the block-quant shape (and the export ``_amax`` orientation).
- Fix unified HF checkpoint export for Llama4 MoE models. The uncalibrated-experts input-quantizer ``amax`` fallback in ``_export_transformers_checkpoint`` special-cased only ``QuantGptOssExperts``; ``QuantLlama4TextExperts`` uses the same fused ``gate_up_proj`` / ``down_proj`` layout and is now handled by the same branch, fixing the export failure.
- Fix ``NotImplementedError: "max_all_cuda" not implemented for 'Float8_e4m3fn'`` during quantization calibration of models with natively FP8 (``float8_e4m3fn`` / ``float8_e5m2``) weights, such as DeepSeek-V3. FP8 dtypes implement no reduction (``max``/``amax``), ``abs``, or elementwise ``maximum`` kernels, so ``reduce_amax`` now upcasts FP8 inputs to the default float dtype before reducing; the upcast is lossless and only affects the FP8 path.

0.44 (2026-05-14)
^^^^^^^^^^^^^^^^^
Expand Down
22 changes: 6 additions & 16 deletions examples/llm_eval/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,18 +16,24 @@ The supported eval tasks are [here](https://github.com/EleutherAI/lm-evaluation-

### Baseline

Both standard HuggingFace models and heterogeneous pruned checkpoints produced by Puzzletron are supported.

- For models which fit on a single GPU:

```sh
python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card> --tasks <comma separated tasks> --batch_size 4
```

For a quick smoke test, add `--limit 10` to any of the above commands to evaluate on only 10 samples.

- With model-sharding (for models which require multiple GPUs):

```sh
python lm_eval_hf.py --model hf --model_args pretrained=<HF model folder or model card>,parallelize=True --tasks <comma separated tasks> --batch_size 4
```

> **Note (Slurm interactive nodes):** On Slurm interactive nodes, `WORLD_SIZE` is set to the number of available GPUs in the shell environment. Running `python` directly causes `lm_eval` to hang waiting for peer ranks that were never spawned. Prepend `WORLD_SIZE=1` to the `python` commands above to fix this. This does not limit GPU usage — `parallelize=True` independently enables model parallelism across all available GPUs within the single process. The `accelerate launch` command manages `WORLD_SIZE` itself and does not require this workaround.

- For data-parallel evaluation with model-sharding:

With the following command, the model will be sharded across `total_num_of_available_gpus/num_copies_of_your_model` with a data-parallelism of `num_copies_of_your_model`
Expand All @@ -40,22 +46,6 @@ accelerate launch --multi_gpu --num_processes <num_copies_of_your_model> \
--batch_size 4
```

### Heterogeneous Pruned Checkpoints (Puzzletron)

Heterogeneous pruned checkpoints produced by Puzzletron are automatically detected and loaded with the appropriate model patcher. No additional flags are needed beyond specifying the checkpoint path:

```sh
python lm_eval_hf.py --model hf \
--model_args pretrained=path/to/anymodel/checkpoint,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

For a quick smoke test, add `--limit 10`.

> **Note:** Requires the `puzzletron` extra to be installed (`pip install -e ".[puzzletron]"`).

### Quantized (simulated)

- For simulated quantization with any of the default quantization formats:
Expand Down
54 changes: 49 additions & 5 deletions examples/llm_ptq/example_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -552,6 +552,44 @@ def get_original_hf_quant_method(config) -> str | None:
return None


def _resolve_init_config(hf_config, auto_model_module, ckpt_path, config_kwargs):
"""Re-derive a built-in config when a remote-code config is used with a built-in model
class, so it matches the model definition's version; fall back to hf_config otherwise.
"""
if auto_model_module in [AutoModelForCausalLM, AutoModel]:
return hf_config
if not type(hf_config).__module__.startswith("transformers_modules"):
return hf_config
builtin_config_kwargs = {k: v for k, v in config_kwargs.items() if k != "trust_remote_code"}
try:
return AutoConfig.from_pretrained(ckpt_path, **builtin_config_kwargs)
except Exception as e:
warnings.warn(
f"Could not re-derive a built-in config for {ckpt_path} ({e}); using the "
"remote-code config for device-map inference."
)
return hf_config


def _get_config_dtype(config):
config_dtype = (
getattr(config, "dtype", None) or getattr(config, "torch_dtype", None) or torch.bfloat16
)
if isinstance(config_dtype, str):
config_dtype = getattr(torch, config_dtype)
return config_dtype


def _apply_dtype_to_config(model_kwargs, config_dtype, architecture, apply_config_dtype=False):
model_kwargs = model_kwargs.copy()
if "DeciLM" in architecture:
model_kwargs["torch_dtype"] = config_dtype
model_kwargs.pop("dtype", None)
elif apply_config_dtype:
model_kwargs["dtype"] = config_dtype
return model_kwargs


def get_model(
ckpt_path,
device="cuda",
Expand Down Expand Up @@ -701,16 +739,21 @@ def has_pack_quantized_config(config):
auto_model_module = getattr(transformers, architecture)
from_config = auto_model_module._from_config

config_for_init = _resolve_init_config(
hf_config, auto_model_module, ckpt_path, config_kwargs
)

with init_empty_weights(include_buffers=True):
# When computing the device_map, assuming bfloat16 precision by default,
# unless specified by the hf_config.
torch_dtype = getattr(hf_config, "torch_dtype", torch.bfloat16)
model_kwargs2 = model_kwargs.copy()
config_dtype = _get_config_dtype(config_for_init)
model_kwargs2 = _apply_dtype_to_config(
model_kwargs, config_dtype, architecture, apply_config_dtype=True
)
if auto_model_module not in [AutoModelForCausalLM, AutoModel]:
model_kwargs2.pop("trust_remote_code", None)
model_kwargs2["dtype"] = torch_dtype
model_kwargs2.pop("max_memory", None)
model = from_config(hf_config, **model_kwargs2)
model = from_config(config_for_init, **model_kwargs2)

max_memory = get_max_memory()
inferred_device_map = infer_auto_device_map(model, max_memory=max_memory)
Expand All @@ -730,10 +773,11 @@ def has_pack_quantized_config(config):
)
model_kwargs["max_memory"] = max_memory

model_kwargs2 = _apply_dtype_to_config(model_kwargs, config_dtype, architecture)
model = auto_model_module.from_pretrained(
ckpt_path,
device_map=device_map,
**model_kwargs,
**model_kwargs2,
)
model.eval()
if has_pack_quantized_config(hf_config):
Expand Down
13 changes: 1 addition & 12 deletions examples/puzzletron/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,18 +256,7 @@ The plot shows how token accuracy changes with different compression rates. High

## Evaluation

Evaluate AnyModel checkpoints using [lm-eval](https://github.com/EleutherAI/lm-evaluation-harness) directly.

```bash
python examples/llm_eval/lm_eval_hf.py \
--model hf \
--model_args pretrained=path/to/checkpoint,dtype=bfloat16,parallelize=True \
--tasks mmlu \
--num_fewshot 5 \
--batch_size 4
```

For a quick smoke test, add `--limit 10`.
Evaluate AnyModel checkpoints using lm-eval. See the [LM-Eval-Harness section](../llm_eval/README.md#lm-eval-harness) in `examples/llm_eval/README.md` for full instructions, including multi-GPU and Slurm setup.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Keep the Puzzletron prerequisite here or in the linked docs.

The removed subsection was the only place that told readers to install the puzzletron extra. Without that note, Puzzletron users can miss the optional dependency and end up on the no-op import path.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/puzzletron/README.md` at line 259, The Puzzletron setup instructions
are missing the prerequisite to install the optional `puzzletron` extra, which
can leave users on the no-op import path. Restore that prerequisite note in this
README or in the linked lm-eval docs, near the Puzzletron/AnyModel evaluation
guidance, so readers know to install the extra before following the `lm-eval`
workflow.


> **Alternative:** For server-based evaluation via an OpenAI-compatible endpoint,
> see [evaluation/nemo_evaluator_instructions.md](./evaluation/nemo_evaluator_instructions.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ autocast_dtype: torch.bfloat16 # dtype for torch.autocast for validate_model
block_size: 8192
bos_rate: 0.5
data_column: messages
val_dataset_name: valid
val_dataset_name: validation
shuffle_seed: 81436
seed: 42
fim_rate: 0
Expand Down
55 changes: 23 additions & 32 deletions examples/torch_onnx/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,8 @@ source venv/bin/activate
pip3 install .

# Verify installation
tensorrt-edgellm-quantize-llm --help
tensorrt-edgellm-export-llm --help
tensorrt-edgellm-quantize --help
tensorrt-edgellm-export --help
```

**System requirements:**
Expand All @@ -137,76 +137,67 @@ tensorrt-edgellm-export-llm --help

| Tool | Purpose |
| :--- | :--- |
| `tensorrt-edgellm-quantize-llm` | Quantize LLM models using ModelOpt (FP8, INT4 AWQ, NVFP4) |
| `tensorrt-edgellm-export-llm` | Export LLM to ONNX with precision-specific optimizations |
| `tensorrt-edgellm-export-visual` | Export visual encoders for multimodal VLM models |
| `tensorrt-edgellm-quantize-draft` | Quantize EAGLE draft models for speculative decoding |
| `tensorrt-edgellm-export-draft` | Export EAGLE draft models to ONNX |
| `tensorrt-edgellm-quantize` | Quantize models using ModelOpt (FP8, INT4 AWQ, NVFP4); subcommands: `llm`, `draft` |
| `tensorrt-edgellm-export` | Export quantized or FP16/BF16 checkpoint to ONNX; auto-detects VLM and audio components |
| `tensorrt-edgellm-insert-lora` | Insert LoRA patterns into existing ONNX models |
| `tensorrt-edgellm-process-lora` | Process LoRA adapter weights for runtime loading |

### Example: Quantize and Export an LLM

```bash
# Step 1: Quantize with ModelOpt
tensorrt-edgellm-quantize-llm \
tensorrt-edgellm-quantize llm \
--model_dir Qwen/Qwen2.5-3B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-3b-fp8

# Step 2: Export to ONNX
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-3b-fp8 \
--output_dir onnx_models/qwen2.5-3b
tensorrt-edgellm-export \
quantized/qwen2.5-3b-fp8 \
onnx_models/qwen2.5-3b
```

### Example: Quantize and Export a VLM

```bash
# Quantize the language model component
tensorrt-edgellm-quantize-llm \
# Quantize with ModelOpt (handles both LLM and visual components)
tensorrt-edgellm-quantize llm \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--quantization fp8 \
--output_dir quantized/qwen2.5-vl-3b

# Export the language model
tensorrt-edgellm-export-llm \
--model_dir quantized/qwen2.5-vl-3b \
--output_dir onnx_models/qwen2.5-vl-3b/llm

# Export the visual encoder
tensorrt-edgellm-export-visual \
--model_dir Qwen/Qwen2.5-VL-3B-Instruct \
--output_dir onnx_models/qwen2.5-vl-3b/visual
# Export to ONNX (auto-detects VLM and exports LLM + visual encoder to separate subdirs)
tensorrt-edgellm-export \
quantized/qwen2.5-vl-3b \
onnx_models/qwen2.5-vl-3b
```

### Example: EAGLE Speculative Decoding

```bash
# Quantize base model
tensorrt-edgellm-quantize-llm \
tensorrt-edgellm-quantize llm \
--model_dir meta-llama/Llama-3.1-8B-Instruct \
--quantization fp8 \
--output_dir quantized/llama3.1-8b-base

# Export base model with EAGLE flag
tensorrt-edgellm-export-llm \
--model_dir quantized/llama3.1-8b-base \
--output_dir onnx_models/llama3.1-8b/base \
--is_eagle_base
tensorrt-edgellm-export \
quantized/llama3.1-8b-base \
onnx_models/llama3.1-8b/base \
--eagle-base

# Quantize EAGLE draft model
tensorrt-edgellm-quantize-draft \
tensorrt-edgellm-quantize draft \
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
--draft_model_dir EAGLE3-LLaMA3.1-Instruct-8B \
--quantization fp8 \
--output_dir quantized/llama3.1-8b-draft

# Export draft model
tensorrt-edgellm-export-draft \
--draft_model_dir quantized/llama3.1-8b-draft \
--base_model_dir meta-llama/Llama-3.1-8B-Instruct \
--output_dir onnx_models/llama3.1-8b/draft
tensorrt-edgellm-export \
quantized/llama3.1-8b-draft \
onnx_models/llama3.1-8b/draft
```

### Quantization Methods
Expand Down
12 changes: 8 additions & 4 deletions modelopt/onnx/autocast/convert.py
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,10 @@ def convert_to_mixed_precision(
graph_sanitizer.sanitize()
model = graph_sanitizer.model

# Setup internal mappings
model = onnx_utils.infer_types(model, use_standalone_type_inference)
# Setup internal mappings. Use strict shape inference so an op ONNX cannot resolve surfaces
# as an exception (triggering infer_types' standalone type-inference fallback) instead of
# silently leaving tensors untyped, which would break later type lookups.
model = onnx_utils.infer_types(model, use_standalone_type_inference, strict_mode=True)
value_info_map, initializer_map, node_to_init_map = utils.setup_mappings(model)

# Automatically add 'trt' to list of providers if custom ops are detected
Expand Down Expand Up @@ -267,8 +269,10 @@ def convert_to_f16(
sanitizer.convert_fp64_to_fp32()
model = sanitizer.model

# Setup internal mappings
model = onnx_utils.infer_types(model, use_standalone_type_inference)
# Setup internal mappings. Use strict shape inference so an op ONNX cannot resolve surfaces
# as an exception (triggering infer_types' standalone type-inference fallback) instead of
# silently leaving tensors untyped, which would break later type lookups.
model = onnx_utils.infer_types(model, use_standalone_type_inference, strict_mode=True)
value_info_map, initializer_map, node_to_init_map = utils.setup_mappings(model)

precision_converter = PrecisionConverter(
Expand Down
Loading
Loading