Skip to content

scripts: complete slime-exact port of all scripts + gpt-oss 20B support#260

Open
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:scripts/gb300-complete-port
Open

scripts: complete slime-exact port of all scripts + gpt-oss 20B support#260
aoshen02 wants to merge 1 commit into
vllm-project:mainfrom
aoshen02:scripts/gb300-complete-port

Conversation

@aoshen02

@aoshen02 aoshen02 commented Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Summary

This PR consolidates three work streams:

1. slime-exact translation of run scripts (original scope)

  • 23 new scripts + 6 existing updated to match slime@cutoff
  • sglang→vllm prefix swap, _slime_vime checkpoint paths, EP boolean conversion, speculative config merge to JSON
  • Translation rules per SGLANG_TO_VLLM_TRANSLATION.md

2. GPT-OSS 20B support

Three fixes required to run GPT-OSS 20B RLHF on vLLM backend:

  • hf_weight_iterator_bridge.py: match Megatron-Bridge 0.5.0 API — maybe_modify_converted_hf_weight gained a 4th hf_state_dict parameter; the monkey-patch accepted only 3, causing TypeError during weight sync. Same fix submitted upstream: fix(gpt-oss): update _patch_bridge_expert_cache_to_cpu to match Megatron-Bridge API THUDM/slime#2113.
  • --hf-checkpoint fused BF16: vLLM _load_weights_other expects gate_up_proj [E, hidden, 2×ffn] (fused). Old per-expert split format causes KeyError on bias loading. tools/convert_gpt_oss_to_fused.py converts without re-running slow MXFP4 dequantization.
  • --qkv-format bshd: GPT-OSS learnable softmax + qkv_format=thd disables all TE attention backends. bshd avoids this; replaced --use-dynamic-batch-size with --seq-length 10240.

3. Restore deleted examples and scripts (from PR #220)

  • examples/coding_agent_rl/, examples/geo3k_vlm/, examples/multi_agent/, examples/train_infer_mismatch_helper/
  • scripts/run-glm4.7-30B-A3B.sh, run-glm4.7-355B-A32B.sh, run-minimax-m2.sh, run-qwen3-30B-A3B.sh

4. Precise pkill pattern (all scripts)

Replace pkill -9 vllm with pkill -9 -f '[v]llm serve|VLL[M]::' — targets only vllm serve and Ray VLLM:: actors, avoiding accidental kill in colocated mode.

Test plan

  • run-gpt-oss-20B.sh: validate rollout starts and weight sync completes (step 1)
  • Other scripts: bash -n syntax check

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds and updates several training and rollout shell scripts for various models, including Qwen, Kimi-K2, DeepSeek-R1, and GLM, to support low-precision training (INT4 and FP8) and integrate vLLM. The review feedback highlights several critical issues, including a missing trailing backslash in run-kimi-k2-Instruct.sh that breaks the Ray job submission, incorrect relative source paths for model configurations across multiple scripts, leftover paths and package names from the 'slime' repository, a typo in the Python buffering environment variable, and a leading blank line before the shebang in run-mimo-7B-rl-eagle.sh.

--actor-num-nodes 32 \
--actor-num-gpus-per-node 8 \
--colocate \
--update-weight-buffer-size $(( 4 * 512 * 1024 * 1024))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The line is missing a trailing backslash \. This will cause the shell to treat the subsequent lines as a separate command, breaking the ray job submit execution.

Suggested change
--update-weight-buffer-size $(( 4 * 512 * 1024 * 1024))
--update-weight-buffer-size $(( 4 * 512 * 1024 * 1024)) \

# --global-batch-size 256

--over-sampling-batch-size 256
--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The package has been renamed/translated from slime to vime (as seen in the codebase structure, e.g., vime/rollout/vllm_rollout.py). Using slime.rollout... will result in a ModuleNotFoundError. Please update this path to use vime instead of slime.

Suggested change
--dynamic-sampling-filter-path slime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std
--dynamic-sampling-filter-path vime.rollout.filter_hub.dynamic_sampling_filters.check_reward_nonzero_std


ray job submit --address="http://127.0.0.1:8265" \
--runtime-env-json="${RUNTIME_ENV_JSON}" \
-- python3 /personal/slime/slime/train.py \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The script is executing /personal/slime/slime/train.py which is a leftover path from the slime repository. It should be updated to train.py to run the vime training script in the current workspace, consistent with the other run scripts.

Suggested change
-- python3 /personal/slime/slime/train.py \
-- python3 train.py \

echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/../scripts/models/qwen3-30B-A3B.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source path ../scripts/models/qwen3-30B-A3B.sh is incorrect. Since this script is located in scripts/low_precision/, .. resolves to scripts/, making the path scripts/scripts/models/... which does not exist. It should be ../models/qwen3-30B-A3B.sh.

Suggested change
source "${SCRIPT_DIR}/../scripts/models/qwen3-30B-A3B.sh"
source "${SCRIPT_DIR}/../models/qwen3-30B-A3B.sh"

echo "HAS_NVLINK: $HAS_NVLINK (detected $NVLINK_COUNT NVLink references)"

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/../scripts/models/qwen3-4B.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source path ../scripts/models/qwen3-4B.sh is incorrect. Since this script is located in scripts/low_precision/, .. resolves to scripts/, making the path scripts/scripts/models/... which does not exist. It should be ../models/qwen3-4B.sh.

Suggested change
source "${SCRIPT_DIR}/../scripts/models/qwen3-4B.sh"
source "${SCRIPT_DIR}/../models/qwen3-4B.sh"


SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &>/dev/null && pwd)"
source "${SCRIPT_DIR}/models/qwen2.5-0.5B.sh"
source "${SCRIPT_DIR}/scripts/models/qwen2.5-0.5B.sh"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The source path was incorrectly changed to ${SCRIPT_DIR}/scripts/models/.... Since this script is located in scripts/, ${SCRIPT_DIR} is already scripts/, making the path scripts/scripts/models/... which does not exist. It should be reverted to ${SCRIPT_DIR}/models/qwen2.5-0.5B.sh.

Suggested change
source "${SCRIPT_DIR}/scripts/models/qwen2.5-0.5B.sh"
source "${SCRIPT_DIR}/models/qwen2.5-0.5B.sh"

Comment thread scripts/run-minimax-m2.sh Outdated
set -ex

export PYTHONUNBUFFERED=1
export PYTHONBUFFERED=16

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The environment variable PYTHONBUFFERED=16 is a typo. The standard Python environment variable to control buffering is PYTHONUNBUFFERED (typically set to 1 to disable buffering). Python does not recognize PYTHONBUFFERED.

Suggested change
export PYTHONBUFFERED=16
export PYTHONUNBUFFERED=1

Comment on lines +1 to +2

#!/bin/bash

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There is a leading empty line before the shebang #!/bin/bash. For a shebang to be recognized by the OS kernel when executing the script directly, it must be the absolute first line of the file.

Suggested change
#!/bin/bash
#!/bin/bash

Comment thread scripts/run-minimax-m2.sh Outdated
# 229B MoE, 256 experts -> requires many GPUs
# Typical config: TP=2, PP=2, EP=4, training side 16 GPUs (2 nodes x 8 GPUs)
# Inference side: vLLM on separate GPUs, EP=16+
# Inference side: SGLang on separate GPUs, EP=16+

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment was updated to refer to SGLang instead of vLLM. Since this PR is migrating the codebase from SGLang to vLLM, this comment is backwards and misleading. It should refer to vLLM.

Suggested change
# Inference side: SGLang on separate GPUs, EP=16+
# Inference side: vLLM on separate GPUs, EP=16+

@aoshen02 aoshen02 force-pushed the scripts/gb300-complete-port branch 2 times, most recently from d5c572e to e5d6f3a Compare June 16, 2026 14:50
@aoshen02 aoshen02 closed this Jun 16, 2026
@aoshen02 aoshen02 force-pushed the scripts/gb300-complete-port branch from e5d6f3a to 2864b34 Compare June 16, 2026 14:53
Translate all slime scripts to vime following SGLANG_TO_VLLM_TRANSLATION.md:
- sglang→vllm prefix swap for CLI flags and variables
- _slime→_vime for checkpoint paths
- EP: --sglang-ep-size N → --vllm-enable-expert-parallel (boolean)
- Speculative: multi-param → --vllm-speculative-config JSON (§5.2)
- Delete genuinely sglang-coupled params (DP-attention, DeepEP, NSA, etc.)
- flashinfer → FLASHINFER case fix (§2.4)

23 new scripts + 6 existing updated to match slime@cutoff.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aoshen02 aoshen02 reopened this Jun 16, 2026
@aoshen02 aoshen02 mentioned this pull request Jun 21, 2026
15 tasks
@aoshen02 aoshen02 changed the title scripts: complete slime-exact translation of all 29 run scripts scripts: complete slime-exact port of all scripts + gpt-oss 20B support Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant