specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835
specdec(recipe): add MiniMax-M2.7-DFlash streaming multi-node pipeline#1835yeyu-nvidia wants to merge 3 commits into
Conversation
…e (OMNIML-5221) Streaming DFlash training for MiniMax-M2.7 (229B MoE): 2 serve replicas (TP=4) + 2 trainer nodes over NIXL RDMA hidden-state transport, matching the Kimi-K2.5 large-MoE topology. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new launcher YAML for MiniMax-M2.7 DFlash streaming multi-node training and updates the streaming launcher script to allow an optional Transformers version override. ChangesMiniMax DFlash launcher job
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Caution Pre-merge checks failedPlease resolve all errors before merging. Addressing warnings is optional.
❌ Failed checks (1 error)
✅ Passed checks (5 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1835 +/- ##
==========================================
- Coverage 77.37% 76.61% -0.76%
==========================================
Files 513 515 +2
Lines 56894 58331 +1437
==========================================
+ Hits 44019 44690 +671
- Misses 12875 13641 +766
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml`:
- Line 11: The recipe comment appears to reference the wrong chat format and may
indicate copied settings that need verification. Update the wording in the
MiniMax launch recipe to match MiniMax-M2.7 terminology, then review the copied
configuration values associated with the MiniMax example, especially the capture
ids, mask token, and rope factor, to ensure they are MiniMax-specific and not
inherited from the Ministral setup. Use the surrounding MiniMax recipe fields
and any related template or config symbols as the source of truth while
correcting the note and confirming the values.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 87479670-62c5-41e0-9389-e1ccbb56e1b0
📒 Files selected for processing (1)
tools/launcher/examples/MiniMax/MiniMax-M2.7-DFlash/hf_streaming_dflash_multi_node.yaml
| - dflash.dflash_export_rope_scaling.mscale_all_dim=1.0 | ||
| environment: | ||
| - HF_MODEL_CKPT: <<global_vars.hf_model>> | ||
| - EAGLE_CAPTURE_IDS: "[2,17,32,47,62,64]" |
There was a problem hiding this comment.
QQ: minimax seems to have 62 layers. What does 64 means here?
There was a problem hiding this comment.
Good catch — MiniMax-M2.7 has 62 hidden layers, not 64. I miscounted from the prior DFlash M3 work (which is 64 layers). Fixed in b71297f: recalculated EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5) → [2,17,31,45,60,62].
…ng YAML MiniMax-M2.7 has 62 hidden layers per its HF config, not 64 as mistakenly used. Recalculates EAGLE_CAPTURE_IDS from build_target_layer_ids(62, 5) and fixes a copy-paste "Ministral" → "MiniMax" comment typo. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
train_eagle_streaming.sh did not honour the OVERRIDE_TRANSFORMERS env var, unlike dflash_online_training.sh. The modelopt requirements.txt pulled transformers 5.3.0, which broke vLLM nightly's import of ALLOWED_LAYER_TYPES (renamed in 5.x). Apply the override AFTER requirements install so the pinned version wins. Signed-off-by: Ye Yu <yeyu@nvidia.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@h-guo18 Thanks for catching the layer count — fixed in b71297f (62 not 64, capture IDs updated to Two additional commits since your review:
Smoke-tested on CW-DFW (cicd_1782759203): training completed 1 step successfully with loss=12.76. All CI checks pass. Ready for re-review when you get a chance. |
Summary
hf_streaming_dflash_multi_node.yamlfor MiniMax-M2.7 (229B MoE) streaming DFlash training[2,17,32,47,62,64]frombuild_target_layer_ids(64, 5)+ final layer outputResolves OMNIML-5221
Test plan
uv run launch.py --yaml ... --dry-run)training.max_steps=1)Signed-off-by: Ye Yu yeyu@nvidia.com
Summary by CodeRabbit