Refine DeciLM dtype handling in HF PTQ#1869
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughRefactors ChangesDtype Helper Refactor
Estimated code review effort: 3 (Moderate) | ~20 minutes Possibly related PRs
Suggested reviewers: 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: realAsma <akuriparambi@nvidia.com>
b54a6c1 to
ff759b3
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1869 +/- ##
==========================================
- Coverage 74.12% 73.76% -0.37%
==========================================
Files 515 515
Lines 57118 57724 +606
==========================================
+ Hits 42338 42578 +240
- Misses 14780 15146 +366
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Signed-off-by: realAsma <akuriparambi@nvidia.com>
Signed-off-by: realAsma <akuriparambi@nvidia.com>
1128a8c to
3ce217b
Compare
realAsma
left a comment
There was a problem hiding this comment.
BB: approve. Make this a regular PR
|
BB: Can you do an end to end export test for the Llama Nemotron model as well as Qwen3 8B? Please share the relevant parts of the log here? Please send the log files from my slack account to the release work channel thread for this PR. |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Clean, small refactor (+24/-16, 2 files) that factors the HF-PTQ config-dtype resolution into two helpers (_get_config_dtype, _apply_dtype_to_config), removing duplicated inline logic across the empty-init and final-load paths. Follow-up to #1857 (NVBug 6359821).
Verified:
_get_config_dtypereproduces the prior inline logic exactly (dtype→torch_dtype→ bf16, str→torch resolution)._apply_dtype_to_configcorrectly unifies both call sites: empty-init passesapply_config_dtype=True(DeciLM→torch_dtype, others→dtype); final load defaults toapply_config_dtype=False(DeciLM→torch_dtype+dropdtype, others unchanged).- Deliberate behavior change: the DeciLM final
from_pretrainednow passestorch_dtype=config_dtype(previously passed no dtype after poppingdtype). This is the stated intent of the PR and is covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16). config_dtypeused after thewith init_empty_weights(...)block is fine —withdoesn't create a new scope.- Test coverage: parametrized over DeciLM and Llama, asserting expected/unexpected dtype kwargs for both
from_configandfrom_pretrained. All assertions trace through correctly.
No licensing changes (existing headers untouched). No prompt-injection in PR metadata. The only caveat is that the real DeciLM from_pretrained path is GPU-only and exercised here via fakes, not end-to-end in CI — consistent with #1857's known GPU-only validation and the author's local pytest run.
Complex PR: 1 existing test file modified or removed. Looping in a human for approval.
E2E export validation on PR head Llama Nemotron: passed. Relevant sanitized log excerpt: Qwen3-8B: did not complete export. The first run with the Nemotron-pinned Transformers 4.48.3 failed because that version does not recognize Relevant sanitized log excerpt: I also posted the summary to the release-work thread. I attempted to upload the full log files there, but Slack rejected file upload from the available integrations: one path lacks file-upload scope, and the BeeBot Slack token is not in that channel. No code was committed or pushed. |
Signed-off-by: realAsma <akuriparambi@nvidia.com>
|
/claude review |
There was a problem hiding this comment.
Claude review passed — no blocking issues found. LGTM
Scope: Reviewed the 2 files in the authoritative PR file list (examples/hf_ptq/example_utils.py, tests/examples/hf_ptq/test_example_utils.py). Note: a raw git diff origin/main HEAD surfaces many unrelated files because the local origin/main is 9 commits ahead of this PR's merge base (per codecov's "9 commits behind head") — those are reverse-diffs of later work, not part of this PR, so I excluded them.
Findings — CRITICAL: 0, IMPORTANT: 0, SUGGESTION: 0
What I verified:
_get_config_dtypereproduces the prior inline dtype resolution exactly (dtype→torch_dtype→ bf16, with str→torch resolution)._apply_dtype_to_configunifies both call sites correctly: empty-init (apply_config_dtype=True) → DeciLM setstorch_dtype+dropsdtype, others setdtype; final load (defaultFalse) → DeciLM setstorch_dtype+dropsdtype, others unchanged (dtype="auto"preserved).config_dtypeis assigned inside thewith init_empty_weights(...)block but consumed after it — safe, sincewithintroduces no new scope and both statements are unconditionally in the same branch.- The deliberate behavior change (DeciLM final
from_pretrainednow forwardstorch_dtype=config_dtype) matches the PR's stated NVBug fix and is GPU-validated (Llama Nemotron passed in the author's e2e log). DeciLM/Nemotron pins older transformers that accepttorch_dtype, so this is the correct kwarg. - The updated test assertions trace correctly through both the DeciLM (
torch_dtype) and Llama (dtype=="auto") branches.
The Qwen3-8B dtype failure noted in the PR thread is on the non-DeciLM empty-init path and is unchanged by this PR (pre-existing, out of scope).
Low risk: small, well-scoped refactor that preserves existing logic with one intentional, tested behavior change.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Clean, small refactor (+23/-16 in example_utils.py, +1/-1 in the test) that factors HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing the duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).
Verified:
_get_config_dtypereproduces the prior inline logic exactly (dtype→torch_dtype→bf16, str→torch resolution)._apply_dtype_to_configunifies both call sites: empty-init (apply_config_dtype=True) → DeciLM setstorch_dtype+dropsdtype, others setdtype=config_dtype; final load (defaultFalse) → DeciLM setstorch_dtype+dropsdtype, others unchanged sodtype="auto"is preserved.config_dtypeassigned insidewith init_empty_weights(...)and consumed after it is safe (withintroduces no new scope; both statements are unconditional in the same branch).- Deliberate behavior change: DeciLM final
from_pretrainednow forwardstorch_dtype=config_dtype(previously passed nothing after poppingdtype). Covered by the updated parametrized test (assert kwargs["torch_dtype"] is torch.float16), which traces correctly for both DeciLM and Llama branches throughfrom_configandfrom_pretrained.
Previous review comments (all addressed):
- 💬 "rename to
_apply_dtype_to_config" — done. - 💬 "handle both DeciLM and general dtype in the helper" — done via
apply_config_dtypeflag. - 💬 "move
is_DeciLMflag into the helper / keep call site model-agnostic" — done; helper takesarchitectureand detects DeciLM internally (commit43ab80f3).
Licensing: no license files touched; existing/standard headers only — not a concern here.
Why nudge rather than approve:
- The real DeciLM
from_pretrainedpath is GPU-only and exercised here via fakes, not end-to-end in CI (consistent with #1857's known GPU-only validation). The author's e2e log shows Llama-Nemotron passing; Qwen3-8B did not complete but that failure is on the unchanged non-DeciLM empty-init path (pre-existing, out of scope). - This modifies an existing test file and lands a deliberate behavior change on a GPU-only path — worth a human sign-off.
- The PR metadata/threads contain author workflow commands ("BB: approve", "BB: push",
/claude review) and bot "LGTM" messages; these are directed at the author's build bot, not injection against the review, and I did not treat them as instructions. Noting them for transparency.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Clean, small refactor (+24/-17, 2 files) factoring HF-PTQ config-dtype resolution into _get_config_dtype and _apply_dtype_to_config, removing duplicated inline logic and the is_decilm flag from get_model. Follow-up to #1857 (NVBug 6359821).
Verified against the full source file:
_get_config_dtypereproduces the prior inline logic exactly (dtype→torch_dtype→bf16, str→torch resolution)._apply_dtype_to_configunifies both call sites: empty-init (apply_config_dtype=True) → DeciLM setstorch_dtype+dropsdtype, others setdtype=config_dtype; final load (defaultFalse) → DeciLM setstorch_dtype+dropsdtype, others unchanged sodtype="auto"is preserved. Matches original branch-by-branch.config_dtypeis assigned insidewith init_empty_weights(...)and consumed after — safe,withintroduces no new scope.- Deliberate behavior change: DeciLM final
from_pretrainednow forwardstorch_dtype=config_dtype(previously passed nothing after poppingdtype), covered by the updated parametrized test assertingkwargs["torch_dtype"] is torch.float16for DeciLM anddtype=="auto"for Llama.
Previous review comments all addressed:
- 💬 Author: rename to
_apply_dtype_to_config— done. - 💬 Author: handle both DeciLM and general dtype in the helper — done via
apply_config_dtypeflag. - 💬 Author: move
is_DeciLMflag into the helper / keep call site model-agnostic — done (commit43ab80f3); helper takesarchitectureand detects DeciLM internally.
Licensing: existing standard headers only, none touched.
Why nudge rather than approve: the DeciLM from_pretrained path is a deliberate behavior change exercised via fakes, not end-to-end in CI (GPU-only; author's log shows Llama-Nemotron passing, Qwen3-8B failing on the unchanged non-DeciLM empty-init path — pre-existing, out of scope). This lands a behavior change on a GPU-only path and modifies an existing test file, so a human should sign off.
Note for transparency: the PR threads contain author-directed build-bot commands ("BB: approve", "BB: push", "/claude review") and bot "LGTM" messages. These are directed at the author's own tooling, not injection against the review; I did not treat them as instructions.
|
Transplant the combined get_model fix from PRs #1839, #1857 and #1869 onto release/0.45.0's examples/llm_ptq/example_utils.py. These PRs could not be cherry-picked directly because the file was renamed llm_ptq -> hf_ptq (#1759) and surrounding get_model code diverged on main, but the actual fix targets the init_empty_weights / from_config block that already exists on the release branch: - _resolve_init_config: re-derive a built-in config for remote-code checkpoints so device-map inference matches the model definition's version (fixes Nemotron-H moe_latent_size AttributeError on transformers 5.x, #1839). - _get_config_dtype / _apply_dtype_to_config: derive dtype from the resolved config and forward the DeciLM-supported dtype kwarg, dropping unsupported dtype forwarding on the real from_pretrained load (#1857, #1869). Ports the accompanying unit tests (path-adjusted to llm_ptq). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com>
#1858 #1839 #1857 #1869 (#1880) ## Cherry-picked PRs - #1801 - #1808 - #1629 - #1627 - #1824 - #1826 - #1830 - #1760 - #1831 - #1858 - #1839 - #1857 - #1869 #1839, #1857 and #1869 were back-ported (not a clean cherry-pick): the file was renamed `llm_ptq` -> `hf_ptq` (#1759) and surrounding `get_model` code diverged on `main`, but the actual fix targets the `init_empty_weights` / `from_config` block that already exists on the release branch. Accompanying unit tests were ported (15 passed). <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit * **New Features** * Added a new PTQ recipe for NVFP4 MLP/MoE quantization with FP8 KV-cache calibration. * **Bug Fixes** * Improved ONNX mixed-precision/FP16 conversion reliability with stricter type handling and better stale output-shape reconciliation. * Fixed quantization/export edge cases: MoE router/gate handling, FP8 calibration/reduction failures, and additional FP8/INT8 robustness during export. * Standardized Puzzletron validation split naming to `validation`. * **Documentation** * Refreshed LM-Eval and TensorRT-Edge-LLM CLI instructions, including updated command names and examples. <!-- end of auto-generated comment: release notes by coderabbit.ai --> --------- Signed-off-by: Meng Xin <mxin@nvidia.com> Signed-off-by: Keval Morabia <28916987+kevalmorabia97@users.noreply.github.com> Signed-off-by: ajrasane <131806219+ajrasane@users.noreply.github.com> Signed-off-by: dimapihtar <dpykhtar@nvidia.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com> Signed-off-by: Zhiyu Cheng <zhiyuc@nvidia.com> Signed-off-by: Grzegorz Karch <gkarch@nvidia.com> Signed-off-by: Daniel Korzekwa <dkorzekwa@nvidia.com> Co-authored-by: mxinO <164952785+mxinO@users.noreply.github.com> Co-authored-by: Ajinkya Rasane <131806219+ajrasane@users.noreply.github.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Dmytro Pykhtar <37850217+dimapihtar@users.noreply.github.com> Co-authored-by: Chenjie Luo <108829653+cjluo-nv@users.noreply.github.com> Co-authored-by: Zhiyu <zhiyuc@nvidia.com> Co-authored-by: Grzegorz K. Karch <grzegorz-k-karch@users.noreply.github.com> Co-authored-by: Daniel Korzekwa <daniel.korzekwa@gmail.com>
Summary
torch_dtypewhile avoiding unsupporteddtypeforwardingFollow-up to #1857 for NVBug 6359821.
Validation
pytest_pwd tests/examples/hf_ptq/test_example_utils.py -q -x(15 passed)git diff --checkpre-commit run --files examples/hf_ptq/example_utils.py tests/examples/hf_ptq/test_example_utils.pySummary by CodeRabbit
Bug Fixes
Tests
from_pretrainedfor causal language model loading scenarios.