Skip to content

Fix HF PTQ empty-init dtype kwargs#1857

Merged
realAsma merged 10 commits into
mainfrom
asma/nvbug-6359821
Jun 30, 2026
Merged

Fix HF PTQ empty-init dtype kwargs#1857
realAsma merged 10 commits into
mainfrom
asma/nvbug-6359821

Conversation

@realAsma

@realAsma realAsma commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes NVBug 6359821: hf_ptq.py can fail for remote/custom architectures like DeciLMForCausalLM when dtype-related kwargs are forwarded into model construction paths that do not accept them.

This change keeps the fix scoped to the observed DeciLM/Llama Nemotron path. It resolves the init config used for empty-weight construction, derives dtype consistently from the resolved config, forwards the supported dtype kwarg for the DeciLM empty-weight probe, and drops unsupported dtype forwarding from the DeciLM real from_pretrained() load.

NVBug: https://nvbugspro.nvidia.com/bug/6359821

Validation

  • pre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.py
  • pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x (15 passed)
  • Actual Llama-3_3-Nemotron-Super-49B-v1 end-to-end hf_ptq.py export on one node with 6 GPUs, Transformers 4.48.3: Fix HF PTQ empty-init dtype kwargs #1857 (comment)

Summary by CodeRabbit

  • Bug Fixes

    • Improved model loading for Hugging Face remote-code scenarios by safely re-deriving the initialization configuration when needed, with a warning-based fallback.
    • Ensured precision is derived consistently from the resolved config (including dtype name handling) with a safe default when unspecified.
    • Tightened forwarding of precision-related kwargs and trust_remote_code, and avoided passing max_memory during config loading.
  • Tests

    • Added unit coverage for initialization config resolution (including failure fallback).
    • Extended integration-style coverage to validate dtype/kwarg forwarding, trust_remote_code behavior, and eval-mode initialization.

@realAsma realAsma added the cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc label Jun 29, 2026
@copy-pr-bot

copy-pr-bot Bot commented Jun 29, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

get_model now re-derives init configs for remote-code cases, computes dtype kwargs from the resolved config, adjusts architecture-specific loading kwargs, and updates tests for the new config and loading paths.

Changes

dtype kwarg fix in get_model

Layer / File(s) Summary
Init config resolution
examples/hf_ptq/example_utils.py
_resolve_init_config(...) re-derives a built-in AutoConfig for remote-code configs used with built-in model classes, and falls back to the original config on failure.
Architecture init and pretrained kwargs
examples/hf_ptq/example_utils.py
get_model uses config_for_init, derives config_dtype from dtype or torch_dtype, strips trust_remote_code for non-AutoModelForCausalLM/AutoModel loaders, applies DeciLM-specific torch_dtype handling, and passes a copied kwargs set to from_pretrained.
Tests for config resolution and kwargs
tests/examples/hf_ptq/test_example_utils.py
Adds tests for _resolve_init_config and a parametrized get_model test covering dtype kwarg forwarding, trust_remote_code, max_memory, eval(), and model-class-specific loading behavior.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1839: Shares the same get_model loading-path changes around config resolution, dtype selection, and trust_remote_code handling.

Suggested reviewers

  • claude
  • meenchen
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 15.38% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly matches the main change: fixing HF PTQ dtype kwargs handling during empty-init and model loading.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed Touched HF PTQ code adds no hardcoded trust_remote_code=True, no bare eval/exec, no unsafe torch.load/numpy.load, and no # nosec; the lone True is only in a test call.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch asma/nvbug-6359821

Comment @coderabbitai help to get the list of available commands.

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BB: Did you do an end to end export test for the DeciLM model?

@realAsma

Copy link
Copy Markdown
Contributor Author

BB:

did you see this: #1857 (review)

@realAsma

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

Replying to #1857 (review)

I saw it. Short answer: not a clean end-to-end export pass on the current PR head yet.

What I verified so far:

  • The actual DeciLM model run got past the original empty-init dtype constructor failure.
  • The current PR head still hit the same unsupported dtype kwarg later in the real from_pretrained path.
  • With a local DeciLM dtype-strip patch for that real-load path, the run loaded the model, selected general/ptq/nvfp4_default-kv_fp8, inserted quantizers, completed the calibration loop, and wrote the quant summary.
  • It then failed later during HF export/save with a separate DeciLM/Transformers tied-weight issue: _get_tied_weight_keys saw a list where it expected .keys().

So the dtype fix is not complete as currently pushed. I have local evidence for the extra real-load dtype patch, but export still has a separate follow-up failure after calibration.

@realAsma

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

Replying to #1857 (review)

I saw it. Short answer: not a clean end-to-end export pass on the current PR head yet.

What I verified so far:

  • The actual DeciLM model run got past the original empty-init dtype constructor failure.
  • The current PR head still hit the same unsupported dtype kwarg later in the real from_pretrained path.
  • With a local DeciLM dtype-strip patch for that real-load path, the run loaded the model, selected general/ptq/nvfp4_default-kv_fp8, inserted quantizers, completed the calibration loop, and wrote the quant summary.
  • It then failed later during HF export/save with a separate DeciLM/Transformers tied-weight issue: _get_tied_weight_keys saw a list where it expected .keys().

So the dtype fix is not complete as currently pushed. I have local evidence for the extra real-load dtype patch, but export still has a separate follow-up failure after calibration.

BB: okay ping me in slack when this is done.

realAsma added 4 commits June 30, 2026 02:48
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma realAsma force-pushed the asma/nvbug-6359821 branch from 0f50049 to 2641497 Compare June 30, 2026 02:48
@realAsma

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

Pushed the follow-up fix as 2641497a19.

This keeps the empty-init device-map path unchanged, but for the final real from_pretrained load it now drops dtype without synthesizing deprecated torch_dtype="auto". I also updated the focused test expectation.

Validation already completed on omniml-a9.nvidia.com with Transformers 4.48.3 against the actual Llama-3_3-Nemotron-Super-49B-v1 hf_ptq flow: all 21 shards loaded, 1461 quantizers inserted, calibration completed, .quant_summary.txt written, custom DeciLM files copied, and export succeeded. DCO is green on the updated PR head.

Comment thread examples/hf_ptq/example_utils.py Outdated
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Comment thread tests/examples/hf_ptq/test_example_utils.py
Signed-off-by: realAsma <akuriparambi@nvidia.com>

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BB: Approve. Make this a regular PR.

@realAsma

realAsma commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author

Llama Nemotron3 end-to-end export validation

Fresh validation passed on the current PR head.

  • PR head: acb6e702e0877f08bbb4c76564be219601d3092c (origin/asma/nvbug-6359821)
  • Hardware: one node with 6 GPUs
  • Container: nvcr.io/nvidia/pytorch:26.03-py3
  • Python: 3.12.3
  • Torch: 2.11.0a0+a6c236b9fd.nv26.03.46836102
  • Transformers: 4.48.3
  • Model: nvidia/Llama-3_3-Nemotron-Super-49B-v1
  • Recipe: general/ptq/nvfp4_default-kv_fp8

hf_ptq.py command:

python hf_ptq.py \
  --pyt_ckpt_path nvidia/Llama-3_3-Nemotron-Super-49B-v1 \
  --recipe general/ptq/nvfp4_default-kv_fp8 \
  --export_path <export_dir> \
  --trust_remote_code \
  --calib_size 1 \
  --batch_size 1 \
  --use_seq_device_map \
  --inference_tensor_parallel 6 \
  --attn_implementation eager \
  --skip_generate

Result from the log:

HEAD is now at acb6e702e0 Fold HF PTQ dtype test cases
transformers 4.48.3
Loading checkpoint shards: 100%|██████████| 21/21
Inserted 1461 quantizers
Quant summary saved to <export_dir>/.quant_summary.txt
Successfully copied 6 custom model files to <export_dir>
Quantized model exported to: <export_dir>. Total time used 99.43220281600952s

Comment thread tests/examples/hf_ptq/test_example_utils.py Outdated
@realAsma realAsma marked this pull request as ready for review June 30, 2026 16:46
@realAsma realAsma requested review from a team as code owners June 30, 2026 16:46
@realAsma realAsma requested a review from cjluo-nv June 30, 2026 16:46
Signed-off-by: realAsma <akuriparambi@nvidia.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/examples/hf_ptq/test_example_utils.py`:
- Around line 209-210: The DeciLM test is still branching on whether
`transformers.DeciLMForCausalLM` exists, so the behavior can drift with package
versions. In the DeciLM parametrization in `test_example_utils.py`, make
`get_model()` take the `AutoModelForCausalLM` path explicitly by stubbing or
deleting `example_utils.transformers.DeciLMForCausalLM` before the call, while
leaving the `LlamaForCausalLM` case unchanged. Use the `get_model` and
`example_utils.transformers` symbols to locate the affected test setup.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9e8119c2-de2e-4ede-9cd8-f91eb5b59553

📥 Commits

Reviewing files that changed from the base of the PR and between 72651b2 and acb6e70.

📒 Files selected for processing (2)
  • examples/hf_ptq/example_utils.py
  • tests/examples/hf_ptq/test_example_utils.py

Comment thread tests/examples/hf_ptq/test_example_utils.py
@realAsma realAsma requested a review from kevalmorabia97 June 30, 2026 17:01
@codecov

codecov Bot commented Jun 30, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.40%. Comparing base (72651b2) to head (7462b8f).
⚠️ Report is 4 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1857      +/-   ##
==========================================
+ Coverage   74.12%   77.40%   +3.28%     
==========================================
  Files         515      515              
  Lines       57118    57118              
==========================================
+ Hits        42338    44213    +1875     
+ Misses      14780    12905    -1875     
Flag Coverage Δ
examples 42.00% <ø> (+0.58%) ⬆️
unit 54.92% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Small, focused fix (+103/-3, 2 files) for NVBug 6359821: avoids leaking the dtype kwarg into custom remote-code constructors (e.g. DeciLMForCausalLM) by special-casing "DeciLM" in architecture to use torch_dtype for the empty-weight from_config() probe and dropping dtype entirely from the real from_pretrained() load. Also robustifies dtype derivation (hf_config.dtypetorch_dtype → bf16, with str→torch resolution). A new parametrized test covers both the DeciLM and Llama paths, asserting the correct dtype kwarg name, trust_remote_code forwarding, and eval(). No licensing concerns (test header is the standard NVIDIA Apache-2.0 block); no injection in the untrusted blocks. Nudging rather than approving for human sign-off on a few points:

  • 💬 Author validated end-to-end on Llama-3_3-Nemotron-Super-49B-v1 (6 GPUs, Transformers 4.48.3) per the PR comment — but the DeciLM path is GPU/remote-code-only and can't be exercised in CI here, so the actual-load behavior change warrants an owner who knows DeciLM.
  • The PR body says the fix "sets PyTorch's default dtype only around the empty-weight construction" — that's not what the diff does (it sets torch_dtype/dtype kwargs explicitly and never touches torch.set_default_dtype). The description appears stale relative to the implementation; worth reconciling so future readers aren't misled.
  • Asymmetry worth a sanity check: for DeciLM the empty-weight probe is told torch_dtype=config_dtype, but the real from_pretrained() gets no dtype at all (dtype popped), so it falls back to the checkpoint's native dtype. That's probably fine (and matches what the probe was approximating), but a quick confirmation that DeciLM loads at the intended precision would be good.
  • The "DeciLM" in architecture substring check is a hardcoded special case; fine for a targeted bugfix but a fragile pattern if more remote-code architectures need the same treatment later.

Signed-off-by: realAsma <akuriparambi@nvidia.com>
hf_config, auto_model_module, ckpt_path, config_kwargs
)

is_decilm = "DeciLM" in architecture

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make this a general WAR instead of DiciLM specific?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Bot comment.

Thanks Wei-Ming. I agree this may be generalizable to other older remote-code models with the same Transformers 5+ incompatibility, but finding and validating those models would be a broader follow-up. For this PR, I would like to keep the fix scoped to the observed Llama Nemotron / DeciLM failure since broader remote-code fallback support is lower value and would need dedicated coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Bot comment.

Thanks Wei-Ming. I agree this could probably be generalized to older remote-code models with the same constructor mismatch, but that would require identifying and validating the affected model set.

For this RC bug, I would keep the fix scoped to Llama Nemotron / DeciLM because that is the reported failure and the broader remote-code support case is lower value without dedicated coverage. I can follow up separately if we find more models with the same failure.

Signed-off-by: realAsma <akuriparambi@nvidia.com>

@meenchen meenchen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve to unblock, please see if we can avoid model arch hardcode

…t-resolve

Signed-off-by: realAsma <akuriparambi@nvidia.com>

# Conflicts:
#	examples/hf_ptq/example_utils.py
#	tests/examples/hf_ptq/test_example_utils.py
@realAsma realAsma enabled auto-merge (squash) June 30, 2026 18:31
Comment on lines +797 to +799
model_kwargs2 = model_kwargs.copy()
if is_decilm:
model_kwargs2.pop("dtype", None)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BB: can we use the same if is_decilm: model_kwargs2["torch_dtype"] = config_dtype model_kwargs2.pop("dtype", None) behavior here as well to be safe? we can create a helper to get the config with the correct dtype.

@realAsma realAsma disabled auto-merge June 30, 2026 19:04
@realAsma realAsma enabled auto-merge (squash) June 30, 2026 19:05
@realAsma realAsma merged commit d70c48c into main Jun 30, 2026
46 checks passed
@realAsma realAsma deleted the asma/nvbug-6359821 branch June 30, 2026 19:12
@github-actions

Copy link
Copy Markdown
Contributor
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-30 19:12 UTC

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review of small, focused bugfix (+104/-3, 2 files) for NVBug 6359821: avoids leaking the modern dtype kwarg into the DeciLM remote-code constructor by special-casing "DeciLM" in architecture to use torch_dtype for the empty-weight from_config() probe and dropping dtype from the real from_pretrained() load. Dtype derivation is also robustified (config.dtypetorch_dtype → bf16, with str→torch resolution). No licensing concerns (test header matches the canonical LICENSE_HEADER; file is modified, not new). No injection attempts in the untrusted blocks.

Status of previous review comments:

  • 💬 CodeRabbit (test:210) — force the AutoModelForCausalLM path so the test doesn't drift with transformers versions: addressed — DeciLM parametrization now does monkeypatch.delattr(example_utils.transformers, architecture, raising=False).
  • 💬 kevalmorabia97 (test:249) — make FakeLlamaForCausalLM a subclass of FakeAutoModelForCausalLM: addressed (subclass overriding only from_pretrained).
  • 💬 Author asked for end-to-end DeciLM/Nemotron export validation: addressed — fresh e2e run posted on the current head acb6e702e0 (Llama-3_3-Nemotron-Super-49B-v1, 6 GPUs, TF 4.48.3): 21 shards loaded, 1461 quantizers, calibration + export succeeded.

Why nudge rather than approve:

  • 💬 meenchen/realAsma asked whether the fix could be a general WAR instead of DeciLM-specific — author justified keeping it scoped to the reported RC bug. The "DeciLM" in architecture substring check is a deliberate, accepted hardcode but remains fragile if more remote-code architectures hit the same constructor mismatch; an owner familiar with DeciLM should confirm this is the right scope.
  • The behavior change (different dtype kwarg + dropping dtype on the real load) only affects the GPU/remote-code path and can't be exercised in CI; the unit test uses fakes/monkeypatch, so the actual-load precision behavior rests on the author's manual e2e run — worth an owner sign-off.
  • 💬 realAsma asked to simplify the tests using tiny Qwen/Deci fixtures; the pushed version instead folds coverage into a parametrized get_model test using FakeModel stubs. Reasonable, but diverges from the requested fixture-based approach — owner should confirm this is acceptable.
  • Minor: the PR body's claim that the fix "sets PyTorch's default dtype only around the empty-weight construction" is stale relative to the diff (it sets dtype/torch_dtype kwargs explicitly, never torch.set_default_dtype). Worth reconciling so future readers aren't misled.

@cjluo-nv cjluo-nv left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review of a small, focused bugfix (+104/-3, 2 files) for NVBug 6359821: avoids leaking the modern dtype kwarg into the DeciLM remote-code constructor by special-casing "DeciLM" in architecture to (a) use torch_dtype=config_dtype for the empty-weight from_config() probe and (b) drop dtype entirely from the real from_pretrained() load. Dtype derivation is robustified (config.dtypetorch_dtype → bf16, with str→torch resolution). No licensing concerns: the test header matches the canonical LICENSE_HEADER, and both files are modified (not new). No prompt-injection attempts in the untrusted blocks (the "BB: Approve"/"Approve to unblock" lines are genuine human reviewer comments, not directives aimed at me).

Status of previous review comments:

  • 💬 realAsma (example_utils:799) — apply the same is_decilm dtype-drop to the real from_pretrained load: addressed — the diff now does if is_decilm: model_kwargs2.pop("dtype", None) on that path.
  • 💬 kevalmorabia97 (test:249) — make FakeLlamaForCausalLM a subclass of FakeAutoModelForCausalLM: addressed (subclass overriding only from_pretrained).
  • 💬 CodeRabbit (test:210) — force the AutoModelForCausalLM path deterministically: addressed — the DeciLM parametrization now does monkeypatch.delattr(example_utils.transformers, architecture, raising=False).
  • 💬 PR-body staleness ("sets PyTorch's default dtype") flagged in prior review: addressed — the body now describes forwarding/dropping kwargs, matching the diff.

Why nudge rather than approve:

  • 💬 Author posted a fresh e2e on the current head acb6e702e0 (Llama-3_3-Nemotron-Super-49B-v1, 6 GPUs, TF 4.48.3): 21 shards, 1461 quantizers, export succeeded — but an earlier run on this PR hit a separate DeciLM/Transformers tied-weight export failure (_get_tied_weight_keys), and the "now clean" claim rests entirely on a manual GPU run that CI cannot reproduce. The unit test only exercises fakes/monkeypatch, so the actual-load precision behavior (dropping dtype → falling back to the checkpoint's native dtype) needs an owner who knows DeciLM to confirm.
  • 💬 meenchen/realAsma asked whether this could be a general WAR instead of DeciLM-specific; author justified keeping it scoped to the reported RC bug. The "DeciLM" in architecture substring is a deliberate, accepted hardcode but remains fragile if more remote-code archs hit the same constructor mismatch — worth an owner confirming the scope is right.

@realAsma

realAsma commented Jul 1, 2026

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

Processed the BB: review-body request.

Outcome:

  • No action needed for the regular-PR step: this PR was already non-draft and is now merged.
  • No approval review was submitted because GitHub does not allow the PR author to approve their own PR.
  • No code changes, commits, or pushes were made.

realAsma added a commit that referenced this pull request Jul 1, 2026
## Summary
- factor config dtype resolution into helpers for HF PTQ model loading
- keep DeciLM empty-init and final-load kwargs on `torch_dtype` while
avoiding unsupported `dtype` forwarding
- update the DeciLM dtype unit assertion for the follow-up behavior

Follow-up to #1857 for NVBug 6359821.

## Validation
- `pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x` (`15
passed`)
- `git diff --check`
- `pre-commit run --files examples/hf_ptq/example_utils.py
tests/examples/hf_ptq/test_example_utils.py`

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Improved model loading so precision (dtype) is applied more
consistently across supported loading paths, including DeciLM models.
* Updated initialization to derive dtype from model configuration and
pass the expected precision into model loading kwargs.

* **Tests**
* Updated test expectations to reflect the new dtype kwarg behavior
during `from_pretrained` for causal language model loading scenarios.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: realAsma <akuriparambi@nvidia.com>
kevalmorabia97 added a commit that referenced this pull request Jul 1, 2026
Transplant the combined get_model fix from PRs #1839, #1857 and #1869
onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could
not be cherry-picked directly because the file was renamed
llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main,
but the actual fix targets the init_empty_weights / from_config block that
already exists on the release branch:

- _resolve_init_config: re-derive a built-in config for remote-code
  checkpoints so device-map inference matches the model definition's
  version (fixes Nemotron-H moe_latent_size AttributeError on transformers
  5.x, #1839).
- _get_config_dtype / _apply_dtype_to_config: derive dtype from the
  resolved config and forward the DeciLM-supported dtype kwarg, dropping
  unsupported dtype forwarding on the real from_pretrained load
  (#1857, #1869).

Ports the accompanying unit tests (path-adjusted to llm_ptq).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
@kevalmorabia97 kevalmorabia97 added the cherry-pick-done Added by bot once PR is cherry-picked to the release branch label Jul 1, 2026
kevalmorabia97 added a commit that referenced this pull request Jul 2, 2026
#1858 #1839 #1857 #1869 (#1880)

## Cherry-picked PRs

- #1801
- #1808
- #1629
- #1627
- #1824
- #1826
- #1830
- #1760
- #1831
- #1858
- #1839
- #1857
- #1869

#1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the
file was
renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code
diverged on
`main`, but the actual fix targets the `init_empty_weights` /
`from_config` block that
already exists on the release branch. Accompanying unit tests were
ported (15 passed).

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **New Features**
* Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8
KV-cache calibration.
* **Bug Fixes**
* Improved ONNX mixed-precision/FP16 conversion reliability with
stricter type handling and better stale output-shape reconciliation.
* Fixed quantization/export edge cases: MoE router/gate handling, FP8
calibration/reduction failures, and additional FP8/INT8 robustness
during export.
  * Standardized Puzzletron validation split naming to `validation`.
* **Documentation**
* Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including
updated command names and examples.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Signed-off-by: Meng Xin <mxin@nvidia.com>
Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com>
Signed-off-by: dimapihtar <dpykhtar@nvidia.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com>
Signed-off-by: Grzegorz Karch <gkarch@nvidia.com>
Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com>
Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com>
Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com>
Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com>
Co-authored-by: Zhiyu <zhiyuc@nvidia.com>
Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com>
Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-0.45.0 After code freeze, cherry-pick to release branch for next rc (bulk update). Only for bug fixes / doc cherry-pick-done Added by bot once PR is cherry-picked to the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants