specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline by yeyu-nvidia · Pull Request #1835 · NVIDIA/Model-Optimizer

yeyu-nvidia · 2026-06-26T17:56:26Z

Summary

Add hf_streaming_dflash_multi_node.yaml for MiniMax-M2.7 (229B MoE) streaming DFlash training
2 serve replicas (TP=4, whole node) + 2 trainer nodes (4 GPU each) over NIXL RDMA hidden-state transport
Capture IDs [2,17,32,47,62,64] from build_target_layer_ids(64, 5) + final layer output
MiniMax-specific: trust_remote_code, FSDP2 via accelerate config, mask_token=200054, YaRN rope_scaling factor=48
Topology matches Kimi-K2.5 large-MoE streaming recipe

Resolves OMNIML-5221

Test plan

Dry-run validation (uv run launch.py --yaml ... --dry-run)
Server-only smoke on CW-DFW (task_1 with training.max_steps=1)
Full streaming training run

Signed-off-by: Ye Yu yeyu@nvidia.com

Summary by CodeRabbit

New Features
- Added a new multi-node launcher configuration for MiniMax-M2.7 streaming training with speculative decoding.
- Includes an end-to-end workflow: dataset preparation, distributed streaming training, and a vLLM speculative-decoding smoke test.
- Supports distributed serving/training runtime settings for checkpoints, timeouts, and accelerator behavior.
Enhancements
- Added support for an optional environment variable to override the Transformers version used during training for improved compatibility.

…e (OMNIML-5221) Streaming DFlash training for MiniMax-M2.7 (229B MoE): 2 serve replicas (TP=4) + 2 trainer nodes over NIXL RDMA hidden-state transport, matching the Kimi-K2.5 large-MoE topology. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

coderabbitai · 2026-06-26T18:01:20Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b4ad9a3-c4c3-4d2b-ae6d-546706bf539d

📥 Commits

Reviewing files that changed from the base of the PR and between b71297f and 53bfa52.

📒 Files selected for processing (1)

tools/launcher/common/eagle3/train_eagle_streaming.sh

📝 Walkthrough

Walkthrough

Adds a new launcher YAML for MiniMax-M2.7 DFlash streaming multi-node training and updates the streaming launcher script to allow an optional Transformers version override.

Changes

MiniMax DFlash launcher job

Layer / File(s)	Summary
Transformers override `tools/launcher/common/eagle3/train_eagle_streaming.sh`	Conditionally installs a pinned `transformers` version when `OVERRIDE_TRANSFORMERS` is set.
Launcher pipeline `tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml`	Defines a new three-step job spec for dataset generation, multi-node streaming DFlash training, and a vLLM DFlash speculative-decoding smoke test.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

Ignore

❌ Failed checks (1 error)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	New MiniMax launcher YAML hardcodes trust_remote_code=true and remote-code flags; SECURITY.md forbids hardcoding trust_remote_code=True without an approved exception.	Parameterize trust_remote_code defaulting to false, or remove it and get explicit modelopt-setup-codeowners approval with a written security justification.

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly describes the new MiniMax-M2.7 DFlash streaming multi-node pipeline added by the PR.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

codecov · 2026-06-26T18:05:14Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (6cc5226) to head (53bfa52).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1835      +/-   ##
==========================================
- Coverage   77.37%   76.61%   -0.76%     
==========================================
  Files         513      515       +2     
  Lines       56894    58331    +1437     
==========================================
+ Hits        44019    44690     +671     
- Misses      12875    13641     +766

Flag	Coverage Δ
regression	`14.83% <ø> (+0.06%)`	⬆️
unit	`54.92% <ø> (+0.29%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

coderabbitai

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml`:
- Line 11: The recipe comment appears to reference the wrong chat format and may
indicate copied settings that need verification. Update the wording in the
MiniMax launch recipe to match MiniMax-M2.7 terminology, then review the copied
configuration values associated with the MiniMax example, especially the capture
ids, mask token, and rope factor, to ensure they are MiniMax-specific and not
inherited from the Ministral setup. Use the surrounding MiniMax recipe fields
and any related template or config symbols as the source of truth while
correcting the note and confirming the values.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87479670-62c5-41e0-9389-e1ccbb56e1b0

📥 Commits

Reviewing files that changed from the base of the PR and between 6cc5226 and 3721cc1.

📒 Files selected for processing (1)

tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml

h-guo18 · 2026-06-27T00:44:20Z

+      - dflash.dflash_export_rope_scaling.mscale_all_dim=1.0
+    environment:
+      - HF_MODEL_CKPT: <<global_vars.hf_model>>
+      - EAGLE_CAPTURE_IDS: "[2,17,32,47,62,64]"


QQ: minimax seems to have 62 layers. What does 64 means here?

Good catch — MiniMax-M2.7 has 62 hidden layers, not 64. I miscounted from the prior DFlash M3 work (which is 64 layers). Fixed in b71297f: recalculated EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5) → [2,17,31,45,60,62].

…ng YAML MiniMax-M2.7 has 62 hidden layers per its HF config, not 64 as mistakenly used. Recalculates EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5) and fixes a copy-paste "Ministral" → "MiniMax" comment typo. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

train_eagle_streaming.sh did not honour the OVERRIDE_TRANSFORMERS env var, unlike dflash_online_training.sh. The modelopt requirements.txt pulled transformers 5.3.0, which broke vLLM nightly's import of ALLOWED_LAYER_TYPES (renamed in 5.x). Apply the override AFTER requirements install so the pinned version wins. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

yeyu-nvidia · 2026-06-29T22:15:31Z

@h-guo18 Thanks for catching the layer count — fixed in b71297f (62 not 64, capture IDs updated to [2,17,31,45,60,62]).

Two additional commits since your review:

8ea2ced: OVERRIDE_TRANSFORMERS support in train_eagle_streaming.sh (was missing, only dflash_online_training.sh had it)
915afca: Role-aware dependency install — serve nodes now skip requirements.txt and transformers override entirely (vLLM v0.24+ hard-rejects transformers v4 at import)

Smoke-tested on CW-DFW (cicd_1782759203): training completed 1 step successfully with loss=12.76. All CI checks pass. Ready for re-review when you get a chance.

yeyu-nvidia requested a review from a team as a code owner June 26, 2026 17:56

coderabbitai Bot reviewed Jun 26, 2026

View reviewed changes

Comment thread tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml Outdated

h-guo18 reviewed Jun 27, 2026

View reviewed changes

yeyu-nvidia and others added 2 commits June 29, 2026 10:11

yeyu-nvidia requested a review from h-guo18 June 29, 2026 22:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835

specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835
yeyu-nvidia wants to merge 3 commits into
NVIDIA:mainfrom
yeyu-nvidia:yeyu/minimax-m2.7-streaming-dflash

yeyu-nvidia commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

Uh oh!

codecov Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

h-guo18 Jun 27, 2026

Uh oh!

yeyu-nvidia Jun 29, 2026

Uh oh!

yeyu-nvidia commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yeyu-nvidia commented Jun 26, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error)

Uh oh!

codecov Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

h-guo18 Jun 27, 2026

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

yeyu-nvidia commented Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yeyu-nvidia commented Jun 26, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 26, 2026 •

edited

Loading

codecov Bot commented Jun 26, 2026 •

edited

Loading