fix: TP config for nemotron-flash-1b + super-49B vllm_deploy cascade by adil-a · Pull Request #2593 · NVIDIA-NeMo/Automodel

adil-a · 2026-06-16T14:15:30Z

What

Fixes two *_vllm_deploy CI tests that cascade from a missing checkpoint:

nemotron-flash-1b PEFT (job 337980668)
llama-3.3-nemotron-super-49B SFT (job 337980592)

Root cause

Both deploys error No checkpoint found under .../robustness_checkpoint because the upstream finetune/robustness producer dies first: a custom-code (trust_remote_code) architecture with no registered TP plan now hard-errors at tp_size>1 (newer torch DTensor shard_order assert; the #2244 fail-fast in parallelizer.py). These previously rode AutoModel's default base plan on older torch.

Fixes

nemotron-flash-1b (nemotron_flash_1b_squad_peft.yaml): NemotronFlash (hybrid mamba2/deltanet) has no TP plan in any transformers version (verified 5.5.0 and latest 5.12.1), in the model's Hub code, or in AutoModel — and its hybrid SSM/conv layers aren't expressible with the standard colwise/rowwise TP styles. The robustness cross-TP phase ran at tp_size=2 and aborted at setup, before the checkpoint was saved. It's a 1B model that doesn't need TP → set the robustness reload to tp_size: 1. (Train→save→AutoModel-reload→HF-reload still validate; only cross-TP-at-2, which never had a real plan, is dropped.)
super-49B / DeciLM-nemotron-nas (llama3_3_nemotron_super_49B_squad.yaml): AutoModel already ships get_decilm_nemotron_tp_plan (named llama_nemotron_super_tp_plan, since fix: tp plan for nemotron super #1487); the recipe never selected it, so the finetune fell through to the broken default plan at tp_size=4. Wire distributed.tp_plan: llama_nemotron_super_tp_plan. (All 49 real-attention blocks have 8 KV heads → divisible by tp=4 finetune and tp=8 robustness; the 31 no-op attention blocks stay replicated.)

Validation

Targeted GitLab pipeline (both recipes + their _vllm_deploy, on this commit): https://gitlab-master.nvidia.com/dl/JoC/nemo-ci/-/pipelines/54962684 (running)

Pre-checks

YAML parses; DCO signed off.

🤖 Generated with Claude Code

Both *_vllm_deploy tests (jobs 337980668 nemotron-flash-1b PEFT, 337980592 llama-3.3-nemotron-super-49B SFT) cascade from "No checkpoint found": the upstream finetune/robustness job dies because a custom-code (trust_remote_code) architecture has no registered TP plan and now hard-errors at tp_size>1 (torch DTensor shard_order assert; #2244 fail-fast in parallelizer.py). These used to ride AutoModel's default base plan on older torch. - nemotron-flash-1b: NemotronFlash (hybrid mamba2/deltanet) has no TP plan in any transformers version (5.5.0/5.12.1), in the model's Hub code, or in AutoModel; its hybrid layers aren't expressible with the standard TP styles. The robustness cross-TP phase ran at tp_size=2 and aborted before the checkpoint was saved. It's a 1B model that doesn't need TP -> run the robustness reload at tp_size=1. - super-49B (DeciLM/nemotron-nas): AutoModel already ships a TP plan (get_decilm_nemotron_tp_plan, named "llama_nemotron_super_tp_plan", since #1487) but the recipe never selected it, so the finetune fell through to the broken default plan at tp_size=4. Wire distributed.tp_plan: llama_nemotron_super_tp_plan. All 49 real-attention blocks have 8 KV heads, divisible by tp 4 (finetune) and 8 (robustness). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Adil Asif <adasif@nvidia.com>

copy-pr-bot · 2026-06-16T14:15:39Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: TP config for nemotron-flash-1b + super-49B vllm_deploy cascade#2593

fix: TP config for nemotron-flash-1b + super-49B vllm_deploy cascade#2593
adil-a wants to merge 1 commit into
mainfrom
adasif/fix/flash1b-super49b-tp-plan

adil-a commented Jun 16, 2026

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

adil-a commented Jun 16, 2026

What

Root cause

Fixes

Validation

Pre-checks

Uh oh!

copy-pr-bot Bot commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant