Skip to content

Refine DeciLM dtype handling in HF PTQ#1869

Merged
realAsma merged 4 commits into
mainfrom
asma/nvbug-6359821-followup
Jul 1, 2026
Merged

Refine DeciLM dtype handling in HF PTQ#1869
realAsma merged 4 commits into
mainfrom
asma/nvbug-6359821-followup

Conversation

@realAsma

@realAsma realAsma commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Summary

  • factor config dtype resolution into helpers for HF PTQ model loading
  • keep DeciLM empty-init and final-load kwargs on torch_dtype while avoiding unsupported dtype forwarding
  • update the DeciLM dtype unit assertion for the follow-up behavior

Follow-up to #1857 for NVBug 6359821.

Validation

  • pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x (15 passed)
  • git diff --check
  • pre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.py

Summary by CodeRabbit

  • Bug Fixes

    • Improved model loading so precision (dtype) is applied more consistently across supported loading paths, including DeciLM models.
    • Updated initialization to derive dtype from model configuration and pass the expected precision into model loading kwargs.
  • Tests

    • Updated test expectations to reflect the new dtype kwarg behavior during from_pretrained for causal language model loading scenarios.

@copy-pr-bot

copy-pr-bot Bot commented Jun 30, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 379ac833-f343-48cd-87ba-1b9295dbb3e6

📥 Commits

Reviewing files that changed from the base of the PR and between 3ce217b and 43ab80f.

📒 Files selected for processing (1)
  • examples/hf_ptq/example_utils.py
🚧 Files skipped from review as they are similar to previous changes (1)
  • examples/hf_ptq/example_utils.py

📝 Walkthrough

Walkthrough

Refactors examples/hf_ptq/example_utils.py to centralize dtype derivation and application in get_model(). The updated flow applies the derived dtype in both the init_empty_weights path and the final from_pretrained call, including changed DeciLM kwargs. A test expectation is updated to match the new kwargs.

Changes

Dtype Helper Refactor

Layer / File(s) Summary
Dtype helper functions
examples/hf_ptq/example_utils.py
Adds _get_config_dtype and _apply_dtype_to_config to derive a torch dtype from config and apply it to model kwargs, with DeciLM-specific handling.
Wire helpers into get_model()
examples/hf_ptq/example_utils.py
Replaces inline dtype logic in the init_empty_weights block and the final from_pretrained call with the new helpers; DeciLM kwargs now set torch_dtype=config_dtype instead of only popping dtype.
Test assertion update
tests/examples/hf_ptq/test_example_utils.py
Updates test_get_model_uses_expected_dtype_kwarg to assert torch_dtype == torch.float16 instead of asserting its absence.

Estimated code review effort: 3 (Moderate) | ~20 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1857: Modifies the same get_model() dtype/kwargs path in example_utils.py, including DeciLM handling and related test updates.

Suggested reviewers: kevalmorabia97, meenchen

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly matches the main change: refining DeciLM dtype handling in HF PTQ.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Touched files add only dtype refactoring; no hardcoded trust_remote_code=True, unsafe loads, eval/exec, or new nosec comments found.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch asma/nvbug-6359821-followup

Comment @coderabbitai help to get the list of available commands.

@realAsma realAsma added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 30, 2026
Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma realAsma force-pushed the asma/nvbug-6359821-followup branch from b54a6c1 to ff759b3 Compare June 30, 2026 19:21
Comment thread examples/hf_ptq/example_utils.py Outdated
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 73.76%. Comparing base (72651b2) to head (43ab80f).
⚠️ Report is 9 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1869      +/-   ##
==========================================
- Coverage   74.12%   73.76%   -0.37%     
==========================================
  Files         515      515              
  Lines       57118    57724     +606     
==========================================
+ Hits        42338    42578     +240     
- Misses      14780    15146     +366     
Flag Coverage Δ
examples 42.00% <ø> (+0.58%) ⬆️
unit 54.91% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Comment thread examples/hf_ptq/example_utils.py Outdated
Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma realAsma force-pushed the asma/nvbug-6359821-followup branch from 1128a8c to 3ce217b Compare June 30, 2026 21:38

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BB: approve. Make this a regular PR

@realAsma realAsma marked this pull request as ready for review July 1, 2026 01:44
@realAsma realAsma requested review from a team as code owners July 1, 2026 01:44
@realAsma

realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

BB: Can you do an end to end export test for the Llama Nemotron model as well as Qwen3 8B?

Please share the relevant parts of the log here? Please send the log files from my slack account to the release work channel thread for this PR.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Clean, small refactor (+24/-16, 2 files) that factors the HF-PTQ config-dtype resolution into two helpers (_get_config_dtype, _apply_dtype_to_config), removing duplicated inline logic across the empty-init and final-load paths. Follow-up to #1857 (NVBug 6359821).

Verified:

  • _get_config_dtype reproduces the prior inline logic exactly (dtypetorch_dtype → bf16, str→torch resolution).
  • _apply_dtype_to_config correctly unifies both call sites: empty-init passes apply_config_dtype=True (DeciLM→torch_dtype, others→dtype); final load defaults to apply_config_dtype=False (DeciLM→torch_dtype+drop dtype, others unchanged).
  • Deliberate behavior change: the DeciLM final from_pretrained now passes torch_dtype=config_dtype (previously passed no dtype after popping dtype). This is the stated intent of the PR and is covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16).
  • config_dtype used after the with init_empty_weights(...) block is fine — with doesn't create a new scope.
  • Test coverage: parametrized over DeciLM and Llama, asserting expected/unexpected dtype kwargs for both from_config and from_pretrained. All assertions trace through correctly.

No licensing changes (existing headers untouched). No prompt-injection in PR metadata. The only caveat is that the real DeciLM from_pretrained path is GPU-only and exercised here via fakes, not end-to-end in CI — consistent with #1857's known GPU-only validation and the author's local pytest run.

Complex PR: 1 existing test file modified or removed. Looping in a human for approval.

Comment thread examples/hf_ptq/example_utils.py Outdated
@realAsma

realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

E2E export validation on PR head 3ce217b3be:

Llama Nemotron: passed.

Relevant sanitized log excerpt:

transformers 4.48.3
config_class DeciLMConfig
command: python hf_ptq.py --pyt_ckpt_path nvidia/Llama-3_3-Nemotron-Super-49B-v1 --recipe general/ptq/nvfp4_default-kv_fp8 --trust_remote_code --calib_size 1 --batch_size 1 --use_seq_device_map --inference_tensor_parallel 6 --attn_implementation eager --skip_generate
Loading checkpoint shards: 100%|...| 21/21
Inserted 1461 quantizers
Quant summary saved to <export_dir>/.quant_summary.txt
Successfully copied 6 custom model files to <export_dir>
Quantized model exported to: <export_dir>. Total time used 93.22441744804382s

Qwen3-8B: did not complete export. The first run with the Nemotron-pinned Transformers 4.48.3 failed because that version does not recognize model_type=qwen3. I reran with transformers==4.52.4; that recognized Qwen3, but failed before weight loading in empty model initialization because dtype is still passed into Qwen3ForCausalLM._from_config.

Relevant sanitized log excerpt:

transformers 4.52.4
config_class Qwen3Config
architectures ['Qwen3ForCausalLM']
command: python hf_ptq.py --pyt_ckpt_path Qwen/Qwen3-8B --recipe general/ptq/nvfp4_default-kv_fp8 --trust_remote_code --calib_size 1 --batch_size 1 --inference_tensor_parallel 1 --attn_implementation eager --skip_generate
Initializing model from Qwen/Qwen3-8B
TypeError: Qwen3ForCausalLM.__init__() got an unexpected keyword argument 'dtype'

I also posted the summary to the release-work thread. I attempted to upload the full log files there, but Slack rejected file upload from the available integrations: one path lacks file-upload scope, and the BeeBot Slack token is not in that channel. No code was committed or pushed.

Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma

realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

/claude review

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Scope: Reviewed the 2 files in the authoritative PR file list (examples/hf_ptq/example_utils.py, tests/examples/hf_ptq/test_example_utils.py). Note: a raw git diff origin/main HEAD surfaces many unrelated files because the local origin/main is 9 commits ahead of this PR's merge base (per codecov's "9 commits behind head") — those are reverse-diffs of later work, not part of this PR, so I excluded them.

Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0

What I verified:

  • _get_config_dtype reproduces the prior inline dtype resolution exactly (dtypetorch_dtype → bf16, with str→torch resolution).
  • _apply_dtype_to_config unifies both call sites correctly: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged (dtype="auto" preserved).
  • config_dtype is assigned inside the with init_empty_weights(...) block but consumed after it — safe, since with introduces no new scope and both statements are unconditionally in the same branch.
  • The deliberate behavior change (DeciLM final from_pretrained now forwards torch_dtype=config_dtype) matches the PR's stated NVBug fix and is GPU-validated (Llama Nemotron passed in the author's e2e log). DeciLM/Nemotron pins older transformers that accept torch_dtype, so this is the correct kwarg.
  • The updated test assertions trace correctly through both the DeciLM (torch_dtype) and Llama (dtype=="auto") branches.

The Qwen3-8B dtype failure noted in the PR thread is on the non-DeciLM empty-init path and is unchanged by this PR (pre-existing, out of scope).

Low risk: small, well-scoped refactor that preserves existing logic with one intentional, tested behavior change.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Clean, small refactor (+23/-16 in example_utils.py, +1/-1 in the test) that factors HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing the duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).

Verified:

  • _get_config_dtype reproduces the prior inline logic exactly (dtypetorch_dtype→bf16, str→torch resolution).
  • _apply_dtype_to_config unifies both call sites: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype=config_dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged so dtype="auto" is preserved.
  • config_dtype assigned inside with init_empty_weights(...) and consumed after it is safe (with introduces no new scope; both statements are unconditional in the same branch).
  • Deliberate behavior change: DeciLM final from_pretrained now forwards torch_dtype=config_dtype (previously passed nothing after popping dtype). Covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16), which traces correctly for both DeciLM and Llama branches through from_config and from_pretrained.

Previous review comments (all addressed):

  • 💬 "rename to _apply_dtype_to_config" — done.
  • 💬 "handle both DeciLM and general dtype in the helper" — done via apply_config_dtype flag.
  • 💬 "move is_DeciLM flag into the helper / keep call site model-agnostic" — done; helper takes architecture and detects DeciLM internally (commit 43ab80f3).

Licensing: no license files touched; existing/standard headers only — not a concern here.

Why nudge rather than approve:

  • The real DeciLM from_pretrained path is GPU-only and exercised here via fakes, not end-to-end in CI (consistent with #1857's known GPU-only validation). The author's e2e log shows Llama-Nemotron passing; Qwen3-8B did not complete but that failure is on the unchanged non-DeciLM empty-init path (pre-existing, out of scope).
  • This modifies an existing test file and lands a deliberate behavior change on a GPU-only path — worth a human sign-off.
  • The PR metadata/threads contain author workflow commands ("BB: approve", "BB: push", /claude review) and bot "LGTM" messages; these are directed at the author's build bot, not injection against the review, and I did not treat them as instructions. Noting them for transparency.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Clean, small refactor (+24/-17, 2 files) factoring HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).

Verified against the full source file:

  • _get_config_dtype reproduces the prior inline logic exactly (dtypetorch_dtype→bf16, str→torch resolution).
  • _apply_dtype_to_config unifies both call sites: empty-init (apply_config_dtype=True) → DeciLM sets torch_dtype+drops dtype, others set dtype=config_dtype; final load (default False) → DeciLM sets torch_dtype+drops dtype, others unchanged so dtype="auto" is preserved. Matches original branch-by-branch.
  • config_dtype is assigned inside with init_empty_weights(...) and consumed after — safe, with introduces no new scope.
  • Deliberate behavior change: DeciLM final from_pretrained now forwards torch_dtype=config_dtype (previously passed nothing after popping dtype), covered by the updated parametrized test asserting kwargs["torch_dtype"] is torch.float16 for DeciLM and dtype=="auto" for Llama.

Previous review comments all addressed:

  • 💬 Author: rename to _apply_dtype_to_config — done.
  • 💬 Author: handle both DeciLM and general dtype in the helper — done via apply_config_dtype flag.
  • 💬 Author: move is_DeciLM flag into the helper / keep call site model-agnostic — done (commit 43ab80f3); helper takes architecture and detects DeciLM internally.

Licensing: existing standard headers only, none touched.

Why nudge rather than approve: the DeciLM from_pretrained path is a deliberate behavior change exercised via fakes, not end-to-end in CI (GPU-only; author's log shows Llama-Nemotron passing, Qwen3-8B failing on the unchanged non-DeciLM empty-init path — pre-existing, out of scope). This lands a behavior change on a GPU-only path and modifies an existing test file, so a human should sign off.

Note for transparency: the PR threads contain author-directed build-bot commands ("BB: approve", "BB: push", "/claude review") and bot "LGTM" messages. These are directed at the author's own tooling, not injection against the review; I did not treat them as instructions.

@realAsma realAsma merged commit 973cb09 into main Jul 1, 2026
62 of 64 checks passed
@realAsma realAsma deleted the asma/nvbug-6359821-followup branch July 1, 2026 18:00
@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-07-01 18:00 UTC

kevalmorabia97 added a commit that referenced this pull request Jul 1, 2026
Transplant the combined get_model fix from PRs #1839, #1857 and #1869
onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could
not be cherry-picked directly because the file was renamed
llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main,
but the actual fix targets the init_empty_weights / from_config block that
already exists on the release branch:

- _resolve_init_config: re-derive a built-in config for remote-code
  checkpoints so device-map inference matches the model definition's
  version (fixes Nemotron-H moe_latent_size AttributeError on transformers
  5.x, #1839).
- _get_config_dtype / _apply_dtype_to_config: derive dtype from the
  resolved config and forward the DeciLM-supported dtype kwarg, dropping
  unsupported dtype forwarding on the real from_pretrained load
  (#1857, #1869).

Ports the accompanying unit tests (path-adjusted to llm_ptq).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jul 1, 2026
kevalmorabia97 added a commit that referenced this pull request Jul 2, 2026
#1858 #1839 #1857 #1869 (#1880)

## Cherry-picked PRs

- #1801
- #1808
- #1629
- #1627
- #1824
- #1826
- #1830
- #1760
- #1831
- #1858
- #1839
- #1857
- #1869

#1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the
file was
renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code
diverged on
`main`, but the actual fix targets the `init_empty_weights` /
`from_config` block that
already exists on the release branch. Accompanying unit tests were
ported (15 passed).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8
KV-cache calibration.
* **Bug Fixes**
* Improved ONNX mixed-precision/FP16 conversion reliability with
stricter type handling and better stale output-shape reconciliation.
* Fixed quantization/export edge cases: MoE router/gate handling, FP8
calibration/reduction failures, and additional FP8/INT8 robustness
during export.
  * Standardized Puzzletron validation split naming to `validation`.
* **Documentation**
* Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including
updated command names and examples.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com>
Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants