Refine DeciLM dtype handling in HF PTQ by realAsma · Pull Request #1869 · NVIDIA/Model-Optimizer

realAsma · 2026-06-30T19:19:49Z

Summary

factor config dtype resolution into helpers for HF PTQ model loading
keep DeciLM empty-init and final-load kwargs on torch_dtype while avoiding unsupported dtype forwarding
update the DeciLM dtype unit assertion for the follow-up behavior

Follow-up to #1857 for NVBug 6359821.

Validation

pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x (15 passed)
git diff --check
pre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.py

Summary by CodeRabbit

Bug Fixes
- Improved model loading so precision (dtype) is applied more consistently across supported loading paths, including DeciLM models.
- Updated initialization to derive dtype from model configuration and pass the expected precision into model loading kwargs.
Tests
- Updated test expectations to reflect the new dtype kwarg behavior during from_pretrained for causal language model loading scenarios.

copy-pr-bot · 2026-06-30T19:19:52Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-06-30T19:19:55Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 379ac833-f343-48cd-87ba-1b9295dbb3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce217b and 43ab80f.

📒 Files selected for processing (1)

examples/hf_ptq/example_utils.py

🚧 Files skipped from review as they are similar to previous changes (1)

examples/hf_ptq/example_utils.py

📝 Walkthrough

Walkthrough

Refactors examples/hf_ptq/example_utils.py to centralize dtype derivation and application in get_model(). The updated flow applies the derived dtype in both the init_empty_weights path and the final from_pretrained call, including changed DeciLM kwargs. A test expectation is updated to match the new kwargs.

Changes

Dtype Helper Refactor

Layer / File(s)	Summary
Dtype helper functions `examples/hf_ptq/example_utils.py`	Adds `_get_config_dtype` and `_apply_dtype_to_config` to derive a torch dtype from config and apply it to model kwargs, with DeciLM-specific handling.
Wire helpers into get_model() `examples/hf_ptq/example_utils.py`	Replaces inline dtype logic in the `init_empty_weights` block and the final `from_pretrained` call with the new helpers; DeciLM kwargs now set `torch_dtype=config_dtype` instead of only popping `dtype`.
Test assertion update `tests/examples/hf_ptq/test_example_utils.py`	Updates `test_get_model_uses_expected_dtype_kwarg` to assert `torch_dtype == torch.float16` instead of asserting its absence.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Possibly related PRs

NVIDIA/Model-Optimizer#1857: Modifies the same get_model() dtype/kwargs path in example_utils.py, including DeciLM handling and related test updates.

Suggested reviewers: kevalmorabia97, meenchen

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly matches the main change: refining DeciLM dtype handling in HF PTQ.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns	✅ Passed	Touched files add only dtype refactoring; no hardcoded trust_remote_code=True, unsafe loads, eval/exec, or new nosec comments found.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch asma/nvbug-6359821-followup

_{Comment @coderabbitai help to get the list of available commands.}

Signed-off-by: realAsma <akuriparambi@nvidia.com>

codecov · 2026-06-30T19:31:01Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.76%. Comparing base (72651b2) to head (43ab80f).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1869      +/-   ##
==========================================
- Coverage   74.12%   73.76%   -0.37%     
==========================================
  Files         515      515              
  Lines       57118    57724     +606     
==========================================
+ Hits        42338    42578     +240     
- Misses      14780    15146     +366

Flag	Coverage Δ
examples	`42.00% <ø> (+0.58%)`	⬆️
unit	`54.91% <ø> (-0.01%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma

BB: approve. Make this a regular PR

realAsma · 2026-07-01T02:10:19Z

BB: Can you do an end to end export test for the Llama Nemotron model as well as Qwen3 8B?

Please share the relevant parts of the log here? Please send the log files from my slack account to the release work channel thread for this PR.

cjluo-nv

Bot review — DM the bot to share feedback.

Clean, small refactor (+24/-16, 2 files) that factors the HF-PTQ config-dtype resolution into two helpers (_get_config_dtype, _apply_dtype_to_config), removing duplicated inline logic across the empty-init and final-load paths. Follow-up to #1857 (NVBug 6359821).

Verified:

_get_config_dtype reproduces the prior inline logic exactly (dtype → torch_dtype → bf16, str→torch resolution).
_apply_dtype_to_config correctly unifies both call sites: empty-init passes apply_config_dtype=True (DeciLM→torch_dtype, others→dtype); final load defaults to apply_config_dtype=False (DeciLM→torch_dtype+drop dtype, others unchanged).
Deliberate behavior change: the DeciLM final from_pretrained now passes torch_dtype=config_dtype (previously passed no dtype after popping dtype). This is the stated intent of the PR and is covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16).
config_dtype used after the with init_empty_weights(...) block is fine — with doesn't create a new scope.
Test coverage: parametrized over DeciLM and Llama, asserting expected/unexpected dtype kwargs for both from_config and from_pretrained. All assertions trace through correctly.

No licensing changes (existing headers untouched). No prompt-injection in PR metadata. The only caveat is that the real DeciLM from_pretrained path is GPU-only and exercised here via fakes, not end-to-end in CI — consistent with #1857's known GPU-only validation and the author's local pytest run.

Complex PR: 1 existing test file modified or removed. Looping in a human for approval.

realAsma · 2026-07-01T02:27:08Z

🤖 Bot comment.

E2E export validation on PR head 3ce217b3be:

Llama Nemotron: passed.

Relevant sanitized log excerpt:

transformers 4.48.3
config_class DeciLMConfig
command: python hf_ptq.py --pyt_ckpt_path nvidia/Llama-3_3-Nemotron-Super-49B-v1 --recipe general/ptq/nvfp4_default-kv_fp8 --trust_remote_code --calib_size 1 --batch_size 1 --use_seq_device_map --inference_tensor_parallel 6 --attn_implementation eager --skip_generate
Loading checkpoint shards: 100%|...| 21/21
Inserted 1461 quantizers
Quant summary saved to <export_dir>/.quant_summary.txt
Successfully copied 6 custom model files to <export_dir>
Quantized model exported to: <export_dir>. Total time used 93.22441744804382s

Qwen3-8B: did not complete export. The first run with the Nemotron-pinned Transformers 4.48.3 failed because that version does not recognize model_type=qwen3. I reran with transformers==4.52.4; that recognized Qwen3, but failed before weight loading in empty model initialization because dtype is still passed into Qwen3ForCausalLM._from_config.

Relevant sanitized log excerpt:

transformers 4.52.4
config_class Qwen3Config
architectures ['Qwen3ForCausalLM']
command: python hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B --recipe general/ptq/nvfp4_default-kv_fp8 --trust_remote_code --calib_size 1 --batch_size 1 --inference_tensor_parallel 1 --attn_implementation eager --skip_generate
Initializing model from Qwen/Qwen3-8B
TypeError: Qwen3ForCausalLM.__init__() got an unexpected keyword argument 'dtype'

I also posted the summary to the release-work thread. I attempted to upload the full log files there, but Slack rejected file upload from the available integrations: one path lacks file-upload scope, and the BeeBot Slack token is not in that channel. No code was committed or pushed.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma · 2026-07-01T03:49:00Z

/claude review

claude

Claude review passed — no blocking issues found. LGTM

Scope: Reviewed the 2 files in the authoritative PR file list (examples/hf_ptq/example_utils.py, tests/examples/hf_ptq/test_example_utils.py). Note: a raw git diff origin/main HEAD surfaces many unrelated files because the local origin/main is 9 commits ahead of this PR's merge base (per codecov's "9 commits behind head") — those are reverse-diffs of later work, not part of this PR, so I excluded them.

Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0

What I verified:

_get_config_dtype reproduces the prior inline dtype resolution exactly (dtype → torch_dtype → bf16, with str→torch resolution).
_apply_dtype_to_config unifies both call sites correctly: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged (dtype="auto" preserved).
config_dtype is assigned inside the with init_empty_weights(...) block but consumed after it — safe, since with introduces no new scope and both statements are unconditionally in the same branch.
The deliberate behavior change (DeciLM final from_pretrained now forwards torch_dtype=config_dtype) matches the PR's stated NVBug fix and is GPU-validated (Llama Nemotron passed in the author's e2e log). DeciLM/Nemotron pins older transformers that accept torch_dtype, so this is the correct kwarg.
The updated test assertions trace correctly through both the DeciLM (torch_dtype) and Llama (dtype=="auto") branches.

The Qwen3-8B dtype failure noted in the PR thread is on the non-DeciLM empty-init path and is unchanged by this PR (pre-existing, out of scope).

Low risk: small, well-scoped refactor that preserves existing logic with one intentional, tested behavior change.

cjluo-nv

Bot review — DM the bot to share feedback.

Clean, small refactor (+23/-16 in example_utils.py, +1/-1 in the test) that factors HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing the duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).

Verified:

_get_config_dtype reproduces the prior inline logic exactly (dtype→torch_dtype→bf16, str→torch resolution).
_apply_dtype_to_config unifies both call sites: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype=config_dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged so dtype="auto" is preserved.
config_dtype assigned inside with init_empty_weights(...) and consumed after it is safe (with introduces no new scope; both statements are unconditional in the same branch).
Deliberate behavior change: DeciLM final from_pretrained now forwards torch_dtype=config_dtype (previously passed nothing after popping dtype). Covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16), which traces correctly for both DeciLM and Llama branches through from_config and from_pretrained.

Previous review comments (all addressed):

💬 "rename to _apply_dtype_to_config" — done.
💬 "handle both DeciLM and general dtype in the helper" — done via apply_config_dtype flag.
💬 "move is_DeciLM flag into the helper / keep call site model-agnostic" — done; helper takes architecture and detects DeciLM internally (commit 43ab80f3).

Licensing: no license files touched; existing/standard headers only — not a concern here.

Why nudge rather than approve:

The real DeciLM from_pretrained path is GPU-only and exercised here via fakes, not end-to-end in CI (consistent with #1857's known GPU-only validation). The author's e2e log shows Llama-Nemotron passing; Qwen3-8B did not complete but that failure is on the unchanged non-DeciLM empty-init path (pre-existing, out of scope).
This modifies an existing test file and lands a deliberate behavior change on a GPU-only path — worth a human sign-off.
The PR metadata/threads contain author workflow commands ("BB: approve", "BB: push", /claude review) and bot "LGTM" messages; these are directed at the author's build bot, not injection against the review, and I did not treat them as instructions. Noting them for transparency.

cjluo-nv

Bot review — DM the bot to share feedback.

Clean, small refactor (+24/-17, 2 files) factoring HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).

Verified against the full source file:

_get_config_dtype reproduces the prior inline logic exactly (dtype→torch_dtype→bf16, str→torch resolution).
_apply_dtype_to_config unifies both call sites: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype=config_dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged so dtype="auto" is preserved. Matches original branch-by-branch.
config_dtype is assigned inside with init_empty_weights(...) and consumed after — safe, with introduces no new scope.
Deliberate behavior change: DeciLM final from_pretrained now forwards torch_dtype=config_dtype (previously passed nothing after popping dtype), covered by the updated parametrized test asserting kwargs["torch_dtype"] is torch.float16 for DeciLM and dtype=="auto" for Llama.

Previous review comments all addressed:

💬 Author: rename to _apply_dtype_to_config — done.
💬 Author: handle both DeciLM and general dtype in the helper — done via apply_config_dtype flag.
💬 Author: move is_DeciLM flag into the helper / keep call site model-agnostic — done (commit 43ab80f3); helper takes architecture and detects DeciLM internally.

Licensing: existing standard headers only, none touched.

Why nudge rather than approve: the DeciLM from_pretrained path is a deliberate behavior change exercised via fakes, not end-to-end in CI (GPU-only; author's log shows Llama-Nemotron passing, Qwen3-8B failing on the unchanged non-DeciLM empty-init path — pre-existing, out of scope). This lands a behavior change on a GPU-only path and modifies an existing test file, so a human should sign off.

Note for transparency: the PR threads contain author-directed build-bot commands ("BB: approve", "BB: push", "/claude review") and bot "LGTM" messages. These are directed at the author's own tooling, not injection against the review; I did not treat them as instructions.

github-actions · 2026-07-01T18:00:27Z

PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-07-01 18:00 UTC

Transplant the combined get_model fix from PRs #1839, #1857 and #1869 onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could not be cherry-picked directly because the file was renamed llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main, but the actual fix targets the init_empty_weights / from_config block that already exists on the release branch: - _resolve_init_config: re-derive a built-in config for remote-code checkpoints so device-map inference matches the model definition's version (fixes Nemotron-H moe_latent_size AttributeError on transformers 5.x, #1839). - _get_config_dtype / _apply_dtype_to_config: derive dtype from the resolved config and forward the DeciLM-supported dtype kwarg, dropping unsupported dtype forwarding on the real from_pretrained load (#1857, #1869). Ports the accompanying unit tests (path-adjusted to llm_ptq). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>

#1858 #1839 #1857 #1869 (#1880) ## Cherry-picked PRs - #1801 - #1808 - #1629 - #1627 - #1824 - #1826 - #1830 - #1760 - #1831 - #1858 - #1839 - #1857 - #1869 #1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code diverged on `main`, but the actual fix targets the `init_empty_weights` / `from_config` block that already exists on the release branch. Accompanying unit tests were ported (15 passed).  ## Summary by CodeRabbit * **New Features** * Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration. * **Bug Fixes** * Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation. * Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export. * Standardized Puzzletron validation split naming to `validation`. * **Documentation** * Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples.  --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>

realAsma added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 30, 2026

Refine DeciLM dtype handling in HF PTQ

ff759b3

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/nvbug-6359821-followup branch from b54a6c1 to ff759b3 Compare June 30, 2026 19:21

realAsma commented Jun 30, 2026

View reviewed changes

Comment thread examples/hf_ptq/example_utils.py Outdated

Rename HF PTQ dtype helper

35d7dc5

Signed-off-by: realAsma <akuriparambi@nvidia.com>

coderabbitai Bot approved these changes Jun 30, 2026

View reviewed changes

realAsma commented Jun 30, 2026

View reviewed changes

Comment thread examples/hf_ptq/example_utils.py Outdated

Move HF PTQ config dtype into helper

3ce217b

Signed-off-by: realAsma <akuriparambi@nvidia.com>

realAsma force-pushed the asma/nvbug-6359821-followup branch from 1128a8c to 3ce217b Compare June 30, 2026 21:38

realAsma commented Jun 30, 2026

View reviewed changes

realAsma marked this pull request as ready for review July 1, 2026 01:44

realAsma requested review from a team as code owners July 1, 2026 01:44

realAsma requested review from cjluo-nv, kevalmorabia97, meenchen and sugunav14 July 1, 2026 01:44

cjluo-nv reviewed Jul 1, 2026

View reviewed changes

realAsma commented Jul 1, 2026

View reviewed changes

Comment thread examples/hf_ptq/example_utils.py Outdated

Infer DeciLM dtype handling in helper

43ab80f

Signed-off-by: realAsma <akuriparambi@nvidia.com>

claude Bot approved these changes Jul 1, 2026

View reviewed changes

kevalmorabia97 approved these changes Jul 1, 2026

View reviewed changes

cjluo-nv reviewed Jul 1, 2026

View reviewed changes

meenchen approved these changes Jul 1, 2026

View reviewed changes

cjluo-nv reviewed Jul 1, 2026

View reviewed changes

Edwardf0t1 approved these changes Jul 1, 2026

View reviewed changes

realAsma merged commit 973cb09 into main Jul 1, 2026
62 of 64 checks passed

realAsma deleted the asma/nvbug-6359821-followup branch July 1, 2026 18:00

kevalmorabia97 mentioned this pull request Jul 1, 2026

[Cherry-pick] PRs #1801 #1808 #1629 #1627 #1824 #1826 #1830 #1760 #1831 #1858 #1839 #1857 #1869 #1880

Merged

kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jul 1, 2026

Uh oh!

Conversation

realAsma commented Jun 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Jun 30, 2026

Uh oh!

coderabbitai Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning)

Uh oh!

Uh oh!

codecov Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

realAsma left a comment

Choose a reason for hiding this comment

Uh oh!

realAsma commented Jul 1, 2026

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

realAsma commented Jul 1, 2026

Uh oh!

realAsma commented Jul 1, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

cjluo-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

github-actions Bot commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

realAsma commented Jun 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 30, 2026 •

edited

Loading

codecov Bot commented Jun 30, 2026 •

edited

Loading