Fix HF PTQ empty-init dtype kwargs#1857
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
📝 WalkthroughWalkthrough
Changesdtype kwarg fix in get_model
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
realAsma
left a comment
There was a problem hiding this comment.
BB: Did you do an end to end export test for the DeciLM model?
|
BB: did you see this: #1857 (review) |
Replying to #1857 (review) I saw it. Short answer: not a clean end-to-end export pass on the current PR head yet. What I verified so far:
So the dtype fix is not complete as currently pushed. I have local evidence for the extra real-load dtype patch, but export still has a separate follow-up failure after calibration. |
BB: okay ping me in slack when this is done. |
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
0f50049 to
2641497
Compare
Pushed the follow-up fix as This keeps the empty-init device-map path unchanged, but for the final real Validation already completed on |
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
realAsma
left a comment
There was a problem hiding this comment.
BB: Approve. Make this a regular PR.
Llama Nemotron3 end-to-end export validationFresh validation passed on the current PR head.
python hf_ptq.py \
--pyt_ckpt_path nvidia/Llama-3_3-Nemotron-Super-49B-v1 \
--recipe general/ptq/nvfp4_default-kv_fp8 \
--export_path <export_dir> \
--trust_remote_code \
--calib_size 1 \
--batch_size 1 \
--use_seq_device_map \
--inference_tensor_parallel 6 \
--attn_implementation eager \
--skip_generateResult from the log: |
Signed-off-by: realAsma <akuriparambi@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/examples/hf_ptq/test_example_utils.py`:
- Around line 209-210: The DeciLM test is still branching on whether
`transformers.DeciLMForCausalLM` exists, so the behavior can drift with package
versions. In the DeciLM parametrization in `test_example_utils.py`, make
`get_model()` take the `AutoModelForCausalLM` path explicitly by stubbing or
deleting `example_utils.transformers.DeciLMForCausalLM` before the call, while
leaving the `LlamaForCausalLM` case unchanged. Use the `get_model` and
`example_utils.transformers` symbols to locate the affected test setup.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9e8119c2-de2e-4ede-9cd8-f91eb5b59553
📒 Files selected for processing (2)
examples/hf_ptq/example_utils.pytests/examples/hf_ptq/test_example_utils.py
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1857 +/- ##
==========================================
+ Coverage 74.12% 77.40% +3.28%
==========================================
Files 515 515
Lines 57118 57118
==========================================
+ Hits 42338 44213 +1875
+ Misses 14780 12905 -1875
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Small, focused fix (+103/-3, 2 files) for NVBug 6359821: avoids leaking the dtype kwarg into custom remote-code constructors (e.g. DeciLMForCausalLM) by special-casing "DeciLM" in architecture to use torch_dtype for the empty-weight from_config() probe and dropping dtype entirely from the real from_pretrained() load. Also robustifies dtype derivation (hf_config.dtype → torch_dtype → bf16, with str→torch resolution). A new parametrized test covers both the DeciLM and Llama paths, asserting the correct dtype kwarg name, trust_remote_code forwarding, and eval(). No licensing concerns (test header is the standard NVIDIA Apache-2.0 block); no injection in the untrusted blocks. Nudging rather than approving for human sign-off on a few points:
- 💬 Author validated end-to-end on
Llama-3_3-Nemotron-Super-49B-v1(6 GPUs, Transformers 4.48.3) per the PR comment — but the DeciLM path is GPU/remote-code-only and can't be exercised in CI here, so the actual-load behavior change warrants an owner who knows DeciLM. - The PR body says the fix "sets PyTorch's default dtype only around the empty-weight construction" — that's not what the diff does (it sets
torch_dtype/dtypekwargs explicitly and never touchestorch.set_default_dtype). The description appears stale relative to the implementation; worth reconciling so future readers aren't misled. - Asymmetry worth a sanity check: for DeciLM the empty-weight probe is told
torch_dtype=config_dtype, but the realfrom_pretrained()gets no dtype at all (dtypepopped), so it falls back to the checkpoint's native dtype. That's probably fine (and matches what the probe was approximating), but a quick confirmation that DeciLM loads at the intended precision would be good. - The
"DeciLM" in architecturesubstring check is a hardcoded special case; fine for a targeted bugfix but a fragile pattern if more remote-code architectures need the same treatment later.
Signed-off-by: realAsma <akuriparambi@nvidia.com>
| hf_config, auto_model_module, ckpt_path, config_kwargs | ||
| ) | ||
|
|
||
| is_decilm = "DeciLM" in architecture |
There was a problem hiding this comment.
Is it possible to make this a general WAR instead of DiciLM specific?
There was a problem hiding this comment.
🤖 Bot comment.
Thanks Wei-Ming. I agree this may be generalizable to other older remote-code models with the same Transformers 5+ incompatibility, but finding and validating those models would be a broader follow-up. For this PR, I would like to keep the fix scoped to the observed Llama Nemotron / DeciLM failure since broader remote-code fallback support is lower value and would need dedicated coverage.
There was a problem hiding this comment.
🤖 Bot comment.
Thanks Wei-Ming. I agree this could probably be generalized to older remote-code models with the same constructor mismatch, but that would require identifying and validating the affected model set.
For this RC bug, I would keep the fix scoped to Llama Nemotron / DeciLM because that is the reported failure and the broader remote-code support case is lower value without dedicated coverage. I can follow up separately if we find more models with the same failure.
Signed-off-by: realAsma <akuriparambi@nvidia.com>
meenchen
left a comment
There was a problem hiding this comment.
Approve to unblock, please see if we can avoid model arch hardcode
…t-resolve Signed-off-by: realAsma <akuriparambi@nvidia.com> # Conflicts: # examples/hf_ptq/example_utils.py # tests/examples/hf_ptq/test_example_utils.py
| model_kwargs2 = model_kwargs.copy() | ||
| if is_decilm: | ||
| model_kwargs2.pop("dtype", None) |
There was a problem hiding this comment.
BB: can we use the same if is_decilm: model_kwargs2["torch_dtype"] = config_dtype model_kwargs2.pop("dtype", None) behavior here as well to be safe? we can create a helper to get the config with the correct dtype.
|
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of small, focused bugfix (+104/-3, 2 files) for NVBug 6359821: avoids leaking the modern dtype kwarg into the DeciLM remote-code constructor by special-casing "DeciLM" in architecture to use torch_dtype for the empty-weight from_config() probe and dropping dtype from the real from_pretrained() load. Dtype derivation is also robustified (config.dtype → torch_dtype → bf16, with str→torch resolution). No licensing concerns (test header matches the canonical LICENSE_HEADER; file is modified, not new). No injection attempts in the untrusted blocks.
Status of previous review comments:
- 💬 CodeRabbit (test:210) — force the
AutoModelForCausalLMpath so the test doesn't drift withtransformersversions: addressed — DeciLM parametrization now doesmonkeypatch.delattr(example_utils.transformers, architecture, raising=False). - 💬 kevalmorabia97 (test:249) — make
FakeLlamaForCausalLMa subclass ofFakeAutoModelForCausalLM: addressed (subclass overriding onlyfrom_pretrained). - 💬 Author asked for end-to-end DeciLM/Nemotron export validation: addressed — fresh e2e run posted on the current head
acb6e702e0(Llama-3_3-Nemotron-Super-49B-v1, 6 GPUs, TF 4.48.3): 21 shards loaded, 1461 quantizers, calibration + export succeeded.
Why nudge rather than approve:
- 💬 meenchen/realAsma asked whether the fix could be a general WAR instead of DeciLM-specific — author justified keeping it scoped to the reported RC bug. The
"DeciLM" in architecturesubstring check is a deliberate, accepted hardcode but remains fragile if more remote-code architectures hit the same constructor mismatch; an owner familiar with DeciLM should confirm this is the right scope. - The behavior change (different dtype kwarg + dropping dtype on the real load) only affects the GPU/remote-code path and can't be exercised in CI; the unit test uses fakes/monkeypatch, so the actual-load precision behavior rests on the author's manual e2e run — worth an owner sign-off.
- 💬 realAsma asked to simplify the tests using tiny Qwen/Deci fixtures; the pushed version instead folds coverage into a parametrized
get_modeltest usingFakeModelstubs. Reasonable, but diverges from the requested fixture-based approach — owner should confirm this is acceptable. - Minor: the PR body's claim that the fix "sets PyTorch's default dtype only around the empty-weight construction" is stale relative to the diff (it sets
dtype/torch_dtypekwargs explicitly, nevertorch.set_default_dtype). Worth reconciling so future readers aren't misled.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review of a small, focused bugfix (+104/-3, 2 files) for NVBug 6359821: avoids leaking the modern dtype kwarg into the DeciLM remote-code constructor by special-casing "DeciLM" in architecture to (a) use torch_dtype=config_dtype for the empty-weight from_config() probe and (b) drop dtype entirely from the real from_pretrained() load. Dtype derivation is robustified (config.dtype → torch_dtype → bf16, with str→torch resolution). No licensing concerns: the test header matches the canonical LICENSE_HEADER, and both files are modified (not new). No prompt-injection attempts in the untrusted blocks (the "BB: Approve"/"Approve to unblock" lines are genuine human reviewer comments, not directives aimed at me).
Status of previous review comments:
- 💬 realAsma (example_utils:799) — apply the same
is_decilmdtype-drop to the realfrom_pretrainedload: addressed — the diff now doesif is_decilm: model_kwargs2.pop("dtype", None)on that path. - 💬 kevalmorabia97 (test:249) — make
FakeLlamaForCausalLMa subclass ofFakeAutoModelForCausalLM: addressed (subclass overriding onlyfrom_pretrained). - 💬 CodeRabbit (test:210) — force the
AutoModelForCausalLMpath deterministically: addressed — the DeciLM parametrization now doesmonkeypatch.delattr(example_utils.transformers, architecture, raising=False). - 💬 PR-body staleness ("sets PyTorch's default dtype") flagged in prior review: addressed — the body now describes forwarding/dropping kwargs, matching the diff.
Why nudge rather than approve:
- 💬 Author posted a fresh e2e on the current head
acb6e702e0(Llama-3_3-Nemotron-Super-49B-v1, 6 GPUs, TF 4.48.3): 21 shards, 1461 quantizers, export succeeded — but an earlier run on this PR hit a separate DeciLM/Transformers tied-weight export failure (_get_tied_weight_keys), and the "now clean" claim rests entirely on a manual GPU run that CI cannot reproduce. The unit test only exercises fakes/monkeypatch, so the actual-load precision behavior (dropping dtype → falling back to the checkpoint's native dtype) needs an owner who knows DeciLM to confirm. - 💬 meenchen/realAsma asked whether this could be a general WAR instead of DeciLM-specific; author justified keeping it scoped to the reported RC bug. The
"DeciLM" in architecturesubstring is a deliberate, accepted hardcode but remains fragile if more remote-code archs hit the same constructor mismatch — worth an owner confirming the scope is right.
Processed the Outcome:
|
## Summary - factor config dtype resolution into helpers for HF PTQ model loading - keep DeciLM empty-init and final-load kwargs on `torch_dtype` while avoiding unsupported `dtype` forwarding - update the DeciLM dtype unit assertion for the follow-up behavior Follow-up to #1857 for NVBug 6359821. ## Validation - `pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x` (`15 passed`) - `git diff --check` - `pre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.py` <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **Bug Fixes** * Improved model loading so precision (dtype) is applied more consistently across supported loading paths, including DeciLM models. * Updated initialization to derive dtype from model configuration and pass the expected precision into model loading kwargs. * **Tests** * Updated test expectations to reflect the new dtype kwarg behavior during `from_pretrained` for causal language model loading scenarios. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: realAsma <akuriparambi@nvidia.com>
Transplant the combined get_model fix from PRs #1839, #1857 and #1869 onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could not be cherry-picked directly because the file was renamed llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main, but the actual fix targets the init_empty_weights / from_config block that already exists on the release branch: - _resolve_init_config: re-derive a built-in config for remote-code checkpoints so device-map inference matches the model definition's version (fixes Nemotron-H moe_latent_size AttributeError on transformers 5.x, #1839). - _get_config_dtype / _apply_dtype_to_config: derive dtype from the resolved config and forward the DeciLM-supported dtype kwarg, dropping unsupported dtype forwarding on the real from_pretrained load (#1857, #1869). Ports the accompanying unit tests (path-adjusted to llm_ptq). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
#1858 #1839 #1857 #1869 (#1880) ## Cherry-picked PRs - #1801 - #1808 - #1629 - #1627 - #1824 - #1826 - #1830 - #1760 - #1831 - #1858 - #1839 - #1857 - #1869 #1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code diverged on `main`, but the actual fix targets the `init_empty_weights` / `from_config` block that already exists on the release branch. Accompanying unit tests were ported (15 passed). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration. * **Bug Fixes** * Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation. * Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export. * Standardized Puzzletron validation split naming to `validation`. * **Documentation** * Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
Summary
Fixes NVBug 6359821:
hf_ptq.pycan fail for remote/custom architectures likeDeciLMForCausalLMwhen dtype-related kwargs are forwarded into model construction paths that do not accept them.This change keeps the fix scoped to the observed DeciLM/Llama Nemotron path. It resolves the init config used for empty-weight construction, derives dtype consistently from the resolved config, forwards the supported dtype kwarg for the DeciLM empty-weight probe, and drops unsupported dtype forwarding from the DeciLM real
from_pretrained()load.NVBug: https://nvbugspro.nvidia.com/bug/6359821
Validation
pre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.pypytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x(15 passed)Llama-3_3-Nemotron-Super-49B-v1end-to-endhf_ptq.pyexport on one node with 6 GPUs, Transformers 4.48.3: Fix HF PTQ empty-init dtype kwargs #1857 (comment)Summary by CodeRabbit
Bug Fixes
trust_remote_code, and avoided passingmax_memoryduring config loading.Tests
trust_remote_codebehavior, and eval-mode initialization.