Skip to content

specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835

Open
yeyu-nvidia wants to merge 3 commits into
NVIDIA:mainfrom
yeyu-nvidia:yeyu/minimax-m2.7-streaming-dflash
Open

specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835
yeyu-nvidia wants to merge 3 commits into
NVIDIA:mainfrom
yeyu-nvidia:yeyu/minimax-m2.7-streaming-dflash

Conversation

@yeyu-nvidia

@yeyu-nvidia yeyu-nvidia commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Add hf_streaming_dflash_multi_node.yaml for MiniMax-M2.7 (229B MoE) streaming DFlash training
  • 2 serve replicas (TP=4, whole node) + 2 trainer nodes (4 GPU each) over NIXL RDMA hidden-state transport
  • Capture IDs [2,17,32,47,62,64] from build_target_layer_ids(64, 5) + final layer output
  • MiniMax-specific: trust_remote_code, FSDP2 via accelerate config, mask_token=200054, YaRN rope_scaling factor=48
  • Topology matches Kimi-K2.5 large-MoE streaming recipe

Resolves OMNIML-5221

Test plan

  • Dry-run validation (uv run launch.py --yaml ... --dry-run)
  • Server-only smoke on CW-DFW (task_1 with training.max_steps=1)
  • Full streaming training run

Signed-off-by: Ye Yu yeyu@nvidia.com

Summary by CodeRabbit

  • New Features
    • Added a new multi-node launcher configuration for MiniMax-M2.7 streaming training with speculative decoding.
    • Includes an end-to-end workflow: dataset preparation, distributed streaming training, and a vLLM speculative-decoding smoke test.
    • Supports distributed serving/training runtime settings for checkpoints, timeouts, and accelerator behavior.
  • Enhancements
    • Added support for an optional environment variable to override the Transformers version used during training for improved compatibility.

…e (OMNIML-5221)

Streaming DFlash training for MiniMax-M2.7 (229B MoE): 2 serve
replicas (TP=4) + 2 trainer nodes over NIXL RDMA hidden-state
transport, matching the Kimi-K2.5 large-MoE topology.

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yeyu-nvidia yeyu-nvidia requested a review from a team as a code owner June 26, 2026 17:56
@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8b4ad9a3-c4c3-4d2b-ae6d-546706bf539d

📥 Commits

Reviewing files that changed from the base of the PR and between b71297f and 53bfa52.

📒 Files selected for processing (1)
  • tools/launcher/common/eagle3/train_eagle_streaming.sh

📝 Walkthrough

Walkthrough

Adds a new launcher YAML for MiniMax-M2.7 DFlash streaming multi-node training and updates the streaming launcher script to allow an optional Transformers version override.

Changes

MiniMax DFlash launcher job

Layer / File(s) Summary
Transformers override
tools/launcher/common/eagle3/train_eagle_streaming.sh
Conditionally installs a pinned transformers version when OVERRIDE_TRANSFORMERS is set.
Launcher pipeline
tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml
Defines a new three-step job spec for dataset generation, multi-node streaming DFlash training, and a vLLM DFlash speculative-decoding smoke test.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes


Caution

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

  • Ignore

❌ Failed checks (1 error)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error New MiniMax launcher YAML hardcodes trust_remote_code=true and remote-code flags; SECURITY.md forbids hardcoding trust_remote_code=True without an approved exception. Parameterize trust_remote_code defaulting to false, or remove it and get explicit modelopt-setup-codeowners approval with a written security justification.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the new MiniMax-M2.7 DFlash streaming multi-node pipeline added by the PR.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@codecov

codecov Bot commented Jun 26, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.61%. Comparing base (6cc5226) to head (53bfa52).
⚠️ Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1835      +/-   ##
==========================================
- Coverage   77.37%   76.61%   -0.76%     
==========================================
  Files         513      515       +2     
  Lines       56894    58331    +1437     
==========================================
+ Hits        44019    44690     +671     
- Misses      12875    13641     +766     
Flag Coverage Δ
regression 14.83% <ø> (+0.06%) ⬆️
unit 54.92% <ø> (+0.29%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml`:
- Line 11: The recipe comment appears to reference the wrong chat format and may
indicate copied settings that need verification. Update the wording in the
MiniMax launch recipe to match MiniMax-M2.7 terminology, then review the copied
configuration values associated with the MiniMax example, especially the capture
ids, mask token, and rope factor, to ensure they are MiniMax-specific and not
inherited from the Ministral setup. Use the surrounding MiniMax recipe fields
and any related template or config symbols as the source of truth while
correcting the note and confirming the values.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 87479670-62c5-41e0-9389-e1ccbb56e1b0

📥 Commits

Reviewing files that changed from the base of the PR and between 6cc5226 and 3721cc1.

📒 Files selected for processing (1)
  • tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml

- dflash.dflash_export_rope_scaling.mscale_all_dim=1.0
environment:
- HF_MODEL_CKPT: <<global_vars.hf_model>>
- EAGLE_CAPTURE_IDS: "[2,17,32,47,62,64]"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QQ: minimax seems to have 62 layers. What does 64 means here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — MiniMax-M2.7 has 62 hidden layers, not 64. I miscounted from the prior DFlash M3 work (which is 64 layers). Fixed in b71297f: recalculated EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5)[2,17,31,45,60,62].

yeyu-nvidia and others added 2 commits June 29, 2026 10:11
…ng YAML

MiniMax-M2.7 has 62 hidden layers per its HF config, not 64 as mistakenly
used. Recalculates EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5) and
fixes a copy-paste "Ministral" → "MiniMax" comment typo.

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_eagle_streaming.sh did not honour the OVERRIDE_TRANSFORMERS env var,
unlike dflash_online_training.sh. The modelopt requirements.txt pulled
transformers 5.3.0, which broke vLLM nightly's import of
ALLOWED_LAYER_TYPES (renamed in 5.x). Apply the override AFTER
requirements install so the pinned version wins.

Signed-off-by: Ye Yu <yeyu@nvidia.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@yeyu-nvidia yeyu-nvidia requested a review from h-guo18 June 29, 2026 22:15
@yeyu-nvidia

Copy link
Copy Markdown
Contributor Author

@h-guo18 Thanks for catching the layer count — fixed in b71297f (62 not 64, capture IDs updated to [2,17,31,45,60,62]).

Two additional commits since your review:

  • 8ea2ced: OVERRIDE_TRANSFORMERS support in train_eagle_streaming.sh (was missing, only dflash_online_training.sh had it)
  • 915afca: Role-aware dependency install — serve nodes now skip requirements.txt and transformers override entirely (vLLM v0.24+ hard-rejects transformers v4 at import)

Smoke-tested on CW-DFW (cicd_1782759203): training completed 1 step successfully with loss=12.76. All CI checks pass. Ready for re-review when you get a chance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants