Skip to content

Fix AutoQuantize causal LM score scaling#1810

Draft
realAsma wants to merge 3 commits into
mainfrom
akuriparambi/fix-aq-score-scaling-20260623
Draft

Fix AutoQuantize causal LM score scaling#1810
realAsma wants to merge 3 commits into
mainfrom
akuriparambi/fix-aq-score-scaling-20260623

Conversation

@realAsma

Copy link
Copy Markdown
Contributor

Summary

  • use a sum-reduced shifted-token causal LM loss for HF AutoQuantize gradient scoring when logits and labels are available
  • preserve the existing output.loss fallback for non-causal or non-dict outputs
  • apply the same correction to the HF PTQ, eval, and autodeploy AutoQuantize entrypoints

Motivation

A Qwen3 8B bs=1 vs bs=8 AutoQuantize run over the same 128 scoring samples showed nearly identical normalized layer sensitivity shapes but very different absolute totals:

  • NVFP4/W4A4 raw score total: bs1 0.5883269825, bs8 0.008391587796, bs1/bs8 70.109x
  • FP8 raw score total: bs1 0.05273269751, bs8 0.0007778984065, bs1/bs8 67.789x
  • Layer Pearson correlation after per-run normalization: NVFP4/W4A4 0.9619, FP8 0.9864

The existing HF callbacks used Transformers output.loss, which is mean-reduced for causal LM labels. AutoQuantize squares gradients and accumulates scores, so grouping samples into larger batches changes the absolute score scale. A sum-reduced next-token loss makes the score additive over valid labels and stable across batch grouping.

Validation

  • python_pwd -m py_compile examples/llm_ptq/hf_ptq.py examples/llm_eval/quantization_utils.py examples/llm_autodeploy/run_auto_quantize.py
  • targeted CPU probe extracted all three _causal_lm_sum_loss helpers and verified the summed loss is differentiable and identical for one combined batch vs split single-sample batches, including -100 ignored labels
  • git diff --check
  • pre-commit run --files examples/llm_ptq/hf_ptq.py examples/llm_eval/quantization_utils.py examples/llm_autodeploy/run_auto_quantize.py passes ruff/format/bandit/license checks; mypy currently fails on existing accelerate attribute errors in modelopt/torch/quantization/plugins/{accelerate,huggingface}.py outside this diff

@copy-pr-bot

copy-pr-bot Bot commented Jun 23, 2026

Copy link
Copy Markdown

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1449bee6-d600-4394-af59-83391d42fe5f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch akuriparambi/fix-aq-score-scaling-20260623

Comment @coderabbitai help to get the list of available commands.

@realAsma realAsma left a comment

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a unit test for tinyqwen tranformers model to make sure that after your fix bs = 1 and say bs = 2 produce similar scores?

Create a 2 sample input calibration dataset for this; sample 1 length: 16, sample2 length: 32 (so basically = 2 means padding is activated).

Comment thread examples/llm_ptq/hf_ptq.py
Signed-off-by: realAsma <akuriparambi@nvidia.com>
@realAsma realAsma force-pushed the akuriparambi/fix-aq-score-scaling-20260623 branch from ccdc827 to 71257f9 Compare June 23, 2026 22:33
@realAsma

Copy link
Copy Markdown
Contributor Author

🤖 Bot comment.

Addressed the TinyQwen/Tiny Qwen3 regression test request in 71257f9ffa.

What changed:

  • Added test_autoquantize_huggingface_scores_are_batch_size_invariant_with_padding in tests/unit/torch/quantization/plugins/test_huggingface.py.
  • The test uses two calibration samples with lengths 16 and 32.
  • It compares the same samples as bs=1 singleton batches versus one bs=2 padded batch and asserts AutoQuantize candidate scores match within tight tolerance.

Validation run locally:

  • git diff --check
  • python_pwd PYTHONPATH=tests -m py_compile tests/unit/torch/quantization/plugins/test_huggingface.py
  • ruff format --check tests/unit/torch/quantization/plugins/test_huggingface.py
  • ruff check tests/unit/torch/quantization/plugins/test_huggingface.py
  • pytest_pwd PYTHONPATH=tests tests/unit/torch/quantization/plugins/test_huggingface.py::test_autoquantize_huggingface_scores_are_batch_size_invariant_with_padding -q

Signed-off-by: realAsma <akuriparambi@nvidia.com>
@codecov

codecov Bot commented Jun 23, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.77%. Comparing base (54fb87e) to head (2a2163e).
⚠️ Report is 130 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1810      +/-   ##
==========================================
- Coverage   73.22%   68.77%   -4.46%     
==========================================
  Files         478      511      +33     
  Lines       52421    64138   +11717     
==========================================
+ Hits        38387    44112    +5725     
- Misses      14034    20026    +5992     
Flag Coverage Δ
unit 54.66% <ø> (+1.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: realAsma <akuriparambi@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant