Skip to content

fix: TP config for nemotron-flash-1b + super-49B vllm_deploy cascade#2593

Draft
adil-a wants to merge 1 commit into
mainfrom
adasif/fix/flash1b-super49b-tp-plan
Draft

fix: TP config for nemotron-flash-1b + super-49B vllm_deploy cascade#2593
adil-a wants to merge 1 commit into
mainfrom
adasif/fix/flash1b-super49b-tp-plan

Conversation

@adil-a

@adil-a adil-a commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

What

Fixes two *_vllm_deploy CI tests that cascade from a missing checkpoint:

Root cause

Both deploys error No checkpoint found under .../robustness_checkpoint because the upstream finetune/robustness producer dies first: a custom-code (trust_remote_code) architecture with no registered TP plan now hard-errors at tp_size>1 (newer torch DTensor shard_order assert; the #2244 fail-fast in parallelizer.py). These previously rode AutoModel's default base plan on older torch.

Fixes

  • nemotron-flash-1b (nemotron_flash_1b_squad_peft.yaml): NemotronFlash (hybrid mamba2/deltanet) has no TP plan in any transformers version (verified 5.5.0 and latest 5.12.1), in the model's Hub code, or in AutoModel — and its hybrid SSM/conv layers aren't expressible with the standard colwise/rowwise TP styles. The robustness cross-TP phase ran at tp_size=2 and aborted at setup, before the checkpoint was saved. It's a 1B model that doesn't need TP → set the robustness reload to tp_size: 1. (Train→save→AutoModel-reload→HF-reload still validate; only cross-TP-at-2, which never had a real plan, is dropped.)
  • super-49B / DeciLM-nemotron-nas (llama3_3_nemotron_super_49B_squad.yaml): AutoModel already ships get_decilm_nemotron_tp_plan (named llama_nemotron_super_tp_plan, since fix: tp plan for nemotron super #1487); the recipe never selected it, so the finetune fell through to the broken default plan at tp_size=4. Wire distributed.tp_plan: llama_nemotron_super_tp_plan. (All 49 real-attention blocks have 8 KV heads → divisible by tp=4 finetune and tp=8 robustness; the 31 no-op attention blocks stay replicated.)

Validation

Pre-checks

  • YAML parses; DCO signed off.

🤖 Generated with Claude Code

Both *_vllm_deploy tests (jobs 337980668 nemotron-flash-1b PEFT,
337980592 llama-3.3-nemotron-super-49B SFT) cascade from
"No checkpoint found": the upstream finetune/robustness job dies because a
custom-code (trust_remote_code) architecture has no registered TP plan and
now hard-errors at tp_size>1 (torch DTensor shard_order assert; #2244
fail-fast in parallelizer.py). These used to ride AutoModel's default base
plan on older torch.

- nemotron-flash-1b: NemotronFlash (hybrid mamba2/deltanet) has no TP plan
  in any transformers version (5.5.0/5.12.1), in the model's Hub code, or
  in AutoModel; its hybrid layers aren't expressible with the standard TP
  styles. The robustness cross-TP phase ran at tp_size=2 and aborted before
  the checkpoint was saved. It's a 1B model that doesn't need TP -> run the
  robustness reload at tp_size=1.

- super-49B (DeciLM/nemotron-nas): AutoModel already ships a TP plan
  (get_decilm_nemotron_tp_plan, named "llama_nemotron_super_tp_plan", since
  #1487) but the recipe never selected it, so the finetune fell through to
  the broken default plan at tp_size=4. Wire
  distributed.tp_plan: llama_nemotron_super_tp_plan. All 49 real-attention
  blocks have 8 KV heads, divisible by tp 4 (finetune) and 8 (robustness).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Adil Asif <adasif@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 16, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant