Skip to content

[DO NOT MERGE][llm][ci] Test vllm's CUDA_VISIBLE_DEVICES fix#64189

Draft
jeffreywang88 wants to merge 1 commit into
masterfrom
vllm-cvd
Draft

[DO NOT MERGE][llm][ci] Test vllm's CUDA_VISIBLE_DEVICES fix#64189
jeffreywang88 wants to merge 1 commit into
masterfrom
vllm-cvd

Conversation

@jeffreywang88

Copy link
Copy Markdown
Contributor

TEST ONLY; DO NOT MERGE

Description

Briefly describe what this PR accomplishes and why it's needed.

Related issues

Link related issues: "Fixes #1234", "Closes #1234", or "Related to #1234".

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

… release tests

Clone the cvd-fix branch (jeffreywang88/vllm) -- vllm-project/vllm#45026's
net diff cherry-picked onto releases/v0.23.0 -- and overlay its 21 runtime
vllm/*.py files onto the installed vllm 0.23.0 wheel, in both
docker/ray-llm/Dockerfile and ci/docker/llm.build.Dockerfile.

PR #45026 ("stop setting CUDA_VISIBLE_DEVICES internally, add --device-ids")
supersedes the vllm-cuda-visible-devices-patch (vLLM #44466), which took the
opposite approach; the two cannot coexist, so the patch step, its *.wanda.yaml
srcs entries, and the patch file are removed. The PR's
sm100_cutlass_mla_kernel.cu change is omitted (needs a wheel recompile and only
affects SM100/MLA).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Jeffrey Wang <jeffreywang@anyscale.com>
@jeffreywang88 jeffreywang88 added the go add ONLY when ready to merge, run all tests label Jun 17, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request replaces the old vllm-cuda-visible-devices-patch by cloning a specific branch of vllm (cvd-fix) and overlaying the modified Python files directly onto the installed vllm package in both ci/docker/llm.build.Dockerfile and docker/ray-llm/Dockerfile. The review feedback recommends adding mkdir -p before copying these files to ensure that any missing target subdirectories are created, thereby preventing potential build failures.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

v1/executor/ray_utils.py \
v1/worker/gpu_worker.py \
v1/worker/worker_base.py; do
cp "/tmp/vllm-cvd-overlay/vllm/${f}" "${VLLM_SITE}/${f}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If any of the target subdirectories (such as v1/engine or distributed/kv_transfer/...) do not exist in the installed vllm package, the cp command will fail with a No such file or directory error. Creating the destination directory structure using mkdir -p before copying avoids this potential build failure.

    mkdir -p "$(dirname "${VLLM_SITE}/${f}")"
    cp "/tmp/vllm-cvd-overlay/vllm/${f}" "${VLLM_SITE}/${f}"

Comment thread docker/ray-llm/Dockerfile
v1/executor/ray_utils.py \
v1/worker/gpu_worker.py \
v1/worker/worker_base.py; do
cp "/tmp/vllm-cvd-overlay/vllm/${f}" "${VLLM_SITE}/${f}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

If any of the target subdirectories (such as v1/engine or distributed/kv_transfer/...) do not exist in the installed vllm package, the cp command will fail with a No such file or directory error. Creating the destination directory structure using mkdir -p before copying avoids this potential build failure.

    mkdir -p "$(dirname "${VLLM_SITE}/${f}")"
    cp "/tmp/vllm-cvd-overlay/vllm/${f}" "${VLLM_SITE}/${f}"

@jeffreywang88

Copy link
Copy Markdown
Contributor Author

Release test with latest fix (efdcc25) from this vllm PR: https://buildkite.com/ray-project/release/builds/97608/canvas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ray fails to serialize self-reference objects

1 participant