Fix three notebook regressions caught by Blackwell Docker validation#694
Fix three notebook regressions caught by Blackwell Docker validation#694danielhanchen wants to merge 13 commits into
Conversation
1. Inductor subprocess GPU invisibility under cgroup-pinned containers. unsloth/_gpu_init.py forces TORCHINDUCTOR_COMPILE_THREADS=1 + UNSLOTH_FORCE_SINGLE_COMPILE_WORKER=1 when only NVIDIA_VISIBLE_DEVICES is set. patch_torch_compile must keep that env var instead of popping it during the non-debug branch; otherwise the compile worker pool respawns and the cgroup-pinned GPU is invisible. Plus the inductor options dict in temporary_patches.common is built from determine_compile_threads which previously ignored the env var, so honour the explicit single-worker forcing there too. 2. Gemma3Processor.__call__ ragged-batch crash on TRL GRPO paged path. _gemma3_call_impl receives `text=[...]` with no `padding=` kwarg from TRL's paged generation collate; upstream Gemma3ProcessorKwargs default is `padding=False` so BatchFeature blows up stacking variable-length input_ids. Force `padding="longest"` only when the caller did not pin padding AND there is more than one text row. Single-image inference path is byte-identical (text rows == 1 branch skipped). Repros: nb/Gemma3_(4B)-Vision-GRPO.ipynb, nb/Qwen3_VL_(8B)-Vision-GRPO.ipynb. 3. (deps in #1) Single-worker compile threads is also the right default for docker --gpus N on Blackwell since the Triton 3.6 driver in the subproc can't enumerate it; this also unbreaks Mistral CPT and gpt-oss-20B fine-tuning notebooks running inside the container. All three changes are gated by environment fingerprints or per-call kwarg detection and are forwards/backwards compatible with transformers 4.57.6 + 5.x and TRL 0.22.2 / 0.27.1 / 1.x.
NameError on `vllm.__version__` at line 749 because `vllm` is only locally imported inside the v1 try/except above. Surfaced when patch_vllm runs in an image that ships without the vllm extra (arm64 wheels, CPU-only, SyntheticDataKit on no-vllm image). Add an early `try: import vllm` at the top of the function so the patch silently returns when vllm is missing. Repro: nb/Meta_Synthetic_Data_Llama3_2_(3B).ipynb on the no-vllm image.
There was a problem hiding this comment.
Code Review
This pull request implements fixes for single-worker compilation in cgroup-pinned environments and introduces automatic 'longest' padding for Gemma3 processing when handling multiple text rows. Feedback suggests making the padding logic more robust by handling cases where text_kwargs might be None and ensuring the row count calculation accounts for image batches when text is not explicitly provided.
| _user_padding = kwargs.get("padding", None) | ||
| if _user_padding is None: | ||
| _user_padding = kwargs.get("text_kwargs", {}).get("padding", None) |
There was a problem hiding this comment.
The check for _user_padding can crash if text_kwargs is explicitly passed as None in kwargs. Using a more robust check would prevent a potential AttributeError when calling .get() on a NoneType object.
_user_padding = kwargs.get("padding")
if _user_padding is None:
_text_kwargs = kwargs.get("text_kwargs")
if isinstance(_text_kwargs, dict):
_user_padding = _text_kwargs.get("padding")| _text_rows = ( | ||
| len(text) if isinstance(text, (list, tuple)) and not isinstance(text, str) else 1 | ||
| ) |
There was a problem hiding this comment.
The _text_rows calculation does not account for cases where text is None but images is a batch (e.g., a list of lists of images). In such cases, text is generated as a list of strings later in the function (line 162). If these generated strings have different lengths (due to varying image counts per batch item), the processor will crash when stacking tensors because padding was not forced to 'longest'. Consider checking the batch size of images when text is None.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c28a4a69b5
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # text row (single-image inference is byte-identical). | ||
| _user_padding = kwargs.get("padding", None) | ||
| if _user_padding is None: | ||
| _user_padding = kwargs.get("text_kwargs", {}).get("padding", None) |
There was a problem hiding this comment.
Guard optional text_kwargs before reading padding
The new _user_padding detection assumes kwargs["text_kwargs"] is always a dict, but kwargs.get("text_kwargs", {}).get("padding") will raise AttributeError when callers pass text_kwargs=None (a common pass-through pattern for optional kwargs). In that case Gemma3Processor.__call__ now fails before _merge_kwargs can normalize defaults, turning previously valid invocations into runtime crashes.
Useful? React with 👍 / 👎.
The module-level `vllm_version` is only defined inside the
`if importlib.util.find_spec("vllm") is not None` block near the top
of vllm_utils.py. Functions that use it unconditionally (load_vllm,
the standby/headroom paths at lines 1992-2274) raise NameError on the
no-vllm image rather than a useful message.
Short-circuit at the start of load_vllm with a clear ImportError so
callers (SyntheticDataKit.from_pretrained, FastLanguageModel.from_pretrained
with fast_inference=True) get an actionable hint instead of a stack trace.
Repro: nb/Meta_Synthetic_Data_Llama3_2_(3B).ipynb on the no-vllm image.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…exts Docker containers run with stdin redirected to /dev/null (or no -i flag) raise EOFError when `unsloth_zoo/llama_cpp.py:281::install_package` tries to read the install confirmation prompt. The exception propagates up through save_pretrained_gguf and surfaces as RuntimeError: Unsloth: GGUF conversion failed: EOF when reading a line This blocks every notebook that calls `model.save_pretrained_gguf(...)` from a docker run. Catch EOFError and default to accept (same as pressing ENTER); opt out with UNSLOTH_AUTO_INSTALL=0 for callers that want the explicit refusal behaviour back. Repro: nb/Llama3_(8B)-Ollama.ipynb on the unsloth-blackwell:no-vllm image.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
…5 rename transformers 5.0 renamed `DeepseekV2MoE` -> `DeepseekV2Moe` (camelCase consistency pass). Remote-code model files that still import the old spelling (e.g. deepseek-ai/DeepSeek-OCR's modeling_deepseekocr.py:22) break on transformers 5.x with `ImportError: cannot import name 'DeepseekV2MoE' ... Did you mean: 'DeepseekV2Moe'?`. Add a class-level alias inside transformers' deepseek_v2 namespace so the legacy name keeps resolving. No-op on transformers 4.x where `DeepseekV2MoE` already exists natively. Repro: nb/Deepseek_OCR_(3B).ipynb on transformers 5.5.0.
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
Two new zoo-level patches that fix six of the GRPO/vision notebook regressions caught by the Blackwell docker validation: 1. unsloth_zoo/vllm_utils.py::patch_vllm_decompose_size_nodes Port of upstream vllm-project/vllm#42543 (still open, SHA 51fd86e). `_decompose_size_nodes` only rewrites top-level `user.args` references to size nodes, missing the ones nested inside `slice(start, stop, step)`, tuples, lists, and kwargs. The trailing `erase_node` then trips torch.fx with RuntimeError: Tried to erase Node size_N but it still had K users on the 5 GRPO notebooks (Qwen3-4B-GRPO, Advanced Llama3.2 GRPO LoRA, Qwen3 8B FP8 GRPO, Llama FP8 GRPO, Qwen2.5-7B VL GRPO). Qwen3-VL- (8B) Vision-GRPO is unaffected because its graph only emits getitem (size, idx) consumers, which the upstream `if` branch already handles. The wrapper ports PR #42543's recursive rewrite, adds a kwargs sweep PR #42543 misses, and falls back to a warn-and-skip safety net (`if node.users: continue`) so the working notebooks stay a no-op. Version-guarded (0.11.0 <= vllm < 0.99), idempotent (`_unsloth_patched` sentinel), opt-out via `UNSLOTH_DISABLE_VLLM_DECOMPOSE_SIZE_PATCH=1`. MLX-safe: the patch sits inside `patch_vllm()` which is only called when `import vllm` succeeds. 2. unsloth_zoo/temporary_patches/notebook_deps.py Three thin auto-install hooks that catch the upstream-blessed import failure points and pip-install one of an allow-listed set of optional notebook deps before re-raising: - `transformers.utils.import_utils.requires_backends` -> fixes timm (TimmWrapper inside Gemma3N + Qwen3-VL). - `transformers.dynamic_module_utils.check_imports` -> fixes addict + matplotlib for trust_remote_code modeling files (Deepseek-OCR). - Pre-emptive `_ensure_notebook_chain()` -> ensures `traitlets` is importable before the IPython chain pulls it in (Gemma3 Vision + Qwen3-VL Vision). Honours the existing `UNSLOTH_AUTO_INSTALL=0` opt-out (matches `llama_cpp.py::install_package`) and the standard offline flags `UNSLOTH_OFFLINE` / `HF_HUB_OFFLINE` / `TRANSFORMERS_OFFLINE`. Heavy / GPU-arch-coupled deps (torch, vllm, flash-attn, bnb, triton, xformers) are explicitly excluded from the allow-list so we never paper over a real CUDA/driver mismatch. Cross-OS: prefers `uv pip install` only when a venv is active, otherwise `sys.executable -m pip install`. Probes site-packages writability and adds `--user` when needed. Windows has no `os.geteuid` so the probe just stays in the venv path. Both module-level transformers imports are inside the patch functions, so MLX-only macOS environments without transformers import the module cleanly (no-op). Repro: - vllm: nb/Qwen3_(4B)-GRPO.ipynb on unsloth-blackwell:test against vllm 0.11.x / nightly cu128. - deps: nb/Gemma3_(4B)-Vision.ipynb, nb/Gemma3N_(4B)-Vision.ipynb, nb/Qwen3_VL_(8B)-Vision.ipynb, nb/Deepseek_OCR_(3B).ipynb on a fresh Colab kernel.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6dec60648d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| size_nodes = list(graph.graph.find_nodes(op="call_method", target="size")) | ||
| for node in size_nodes: | ||
| tensor_node = node.args[0] |
There was a problem hiding this comment.
Handle
size(dim) nodes before decomposing dimensions
This pass rewrites every call_method("size") node as if it were x.size() returning a full shape tuple, but FX graphs also emit x.size(dim) nodes that represent a single dimension. Because the current logic always builds dims from ev.shape and later substitutes from that list, size(dim) users can be rewritten to the wrong dimension (e.g., slice bounds) and silently change tensor shapes/results instead of preserving semantics.
Useful? React with 👍 / 👎.
|
|
||
| patch_requires_backends_autoinstall() | ||
| patch_check_imports_autoinstall() | ||
| _ensure_notebook_chain() |
There was a problem hiding this comment.
Defer notebook dependency installs from import-time execution
This module invokes _ensure_notebook_chain() at import time, and temporary_patches/__init__.py now imports this file unconditionally, so a normal import unsloth can immediately trigger pip installation attempts for traitlets. That introduces network/package-manager side effects in non-notebook runs and can block or fail startup in restricted/offline environments before any notebook code path is used.
Useful? React with 👍 / 👎.
The Blackwell docker image (and most cu128 runtime images) ship the
CUDA *runtime* libs but not nvcc, and FlashInfer requires a JIT
compile of its trtllm-gen kernels at first use. The pre-flight at the
top of `load_vllm` was already detecting the missing toolchain and
printing a warning, but the cleanup only ran
if VLLM_USE_FLASHINFER_SAMPLER == "1": del ...
if VLLM_ATTENTION_BACKEND == "FLASHINFER": del ...
That `del` cleanup is not enough on vLLM nightly: the v1 engine picks
FlashInfer as the *default* attention backend on sm_100 / sm_120
(Blackwell). When neither env var is pre-set, the `del` branch never
fires and vLLM proceeds to JIT-compile FlashInfer kernels, which
crashes with
RuntimeError: FlashInfer failed to JIT-compile: ninja (build tool)
not found.
(reproduced on `unsloth-blackwell:test` running Qwen3-(4B)-GRPO).
Replace the conditional `del`s with explicit pins:
VLLM_ATTENTION_BACKEND = FLASH_ATTN
VLLM_USE_FLASHINFER_SAMPLER = 0
UNSLOTH_VLLM_NO_FLASHINFER = 1
so vLLM picks the FLASH_ATTN path (no JIT) and any downstream code in
unsloth_zoo / load_vllm sees a consistent disable signal. No-op when
nvcc and ninja are both present: that else-branch is untouched.
Compatible with vllm 0.9 -> nightly; transformers 4.57.6 / 5.x / main;
TRL 0.22.2 / 0.27.1 / 1.x. MLX-safe: this is inside `load_vllm`, only
called when `fast_inference=True`, which never fires on the MLX path.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 30d603eee3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(a, fx.Node) and a is node: | ||
| out.extend(dims) | ||
| elif isinstance(a, slice): |
There was a problem hiding this comment.
Preserve shape tuple when replacing
size() arguments
The recursive rewrite expands a size() node into multiple positional args (out.extend(dims)), which breaks call sites that expect a single shape container (e.g. aten.view.default(x, size) / aten.reshape.default(x, shape)). After this transform those nodes become aten.view.default(x, d0, d1, ...), violating the op schema and causing runtime failures during graph execution/compilation for graphs that use size() as a shape argument rather than getitem(size, i).
Useful? React with 👍 / 👎.
Deepseek-OCR's modeling_deepseekocr.py imports einops at the top of the file (via deepencoder.py:11 `from einops import rearrange`). It is not in transformers' BACKENDS_MAPPING and not declared in the "This modeling file requires ..." line that check_imports parses, so neither of the existing zoo hooks fires for it. Add `einops` to _ALLOW_LIST so the requires_backends / check_imports wrappers install it on first failure.
Deepseek-OCR's deepencoder.py:12 imports `from easydict import EasyDict as adict`. easydict and addict are DIFFERENT PyPI packages -- a notebook can require either, and Deepseek-OCR specifically needs easydict (despite the alias name suggesting addict). einops + easydict together unblock Deepseek-OCR's trust_remote_code modeling file at first load.
Orpheus TTS notebook imports `from snac import SNAC` after model load to decode the speech tokens. snac is a small (~1MB) pure-Python wheel with torch as the only heavy dep, so safe to auto-install.
Setting VLLM_ATTENTION_BACKEND=FLASH_ATTN doesn't work on vLLM 0.19.1: envs.py reports "Unknown vLLM environment variable detected: VLLM_ATTENTION_BACKEND", and vLLM still picks FLASHINFER from its ['FLASHINFER', 'FLASH_ATTN', 'TRITON_ATTN', 'FLEX_ATTENTION'] default list on sm_100/sm_120 (cuda.py:334), then JIT-compiles trtllm-gen kernels and dies on `/usr/local/cuda/bin/nvcc: not found`. Block the flashinfer module instead. Drop any cached imports from sys.modules, then set `sys.modules["flashinfer"] = None` so subsequent `import flashinfer` raises ImportError. vLLM's attention-backend selector falls back to FLASH_ATTN -> no JIT, no crash. This is the documented Python idiom for "this module is unavailable" (see Python language reference, "The module cache" section). Reversible per-process; the only caller is the load_vllm pre-flight, which runs once. Repro: nb/Qwen3_(4B)-GRPO.ipynb on unsloth-blackwell:test where the runtime image ships /usr/local/cuda runtime libs but no nvcc.
…sets 4.x audio Feature decoder)
| ) | ||
| continue | ||
| graph.graph.erase_node(node) | ||
| pass |
| _decompose_size_nodes._unsloth_patched = True | ||
| _B._decompose_size_nodes = _decompose_size_nodes | ||
| logger.info("Unsloth: patched vllm.compilation.backends._decompose_size_nodes (vllm#42543).") | ||
| pass |
|
|
||
|
|
||
| from .common import * | ||
| from .notebook_deps import * |
| importlib.invalidate_caches() | ||
| try: | ||
| list(importlib.metadata.distributions()) | ||
| except Exception: |
| if _name == "flashinfer" or _name.startswith("flashinfer."): | ||
| del sys.modules[_name] | ||
| sys.modules["flashinfer"] = None | ||
| except Exception: |
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
| return | ||
| if not hasattr(_m, "DeepseekV2MoE") and hasattr(_m, "DeepseekV2Moe"): | ||
| _m.DeepseekV2MoE = _m.DeepseekV2Moe | ||
| pass |
|
Validation update from the Docker image work (unsloth#5748): reran the two vision GRPO repros on an image built from this branch, against main zoo as control.
Also worth noting: this branch is on zoo 2026.5.4 while main is at 2026.6.3, so it needs a merge from main before landing; main has since picked up equivalents of several of the other patches here. |
Summary
Three small patches that unblock notebooks failing inside the Unsloth Blackwell Docker image (unslothai/unsloth#5748). All three were root-caused under three independent investigations and verified locally on B200 against the cu128 + Triton 3.6 + transformers 5.5 stack.
Preserve the Inductor single-worker forcing env var.
unsloth/_gpu_init.pynow setsTORCHINDUCTOR_COMPILE_THREADS=1+UNSLOTH_FORCE_SINGLE_COMPILE_WORKER=1when onlyNVIDIA_VISIBLE_DEVICESis set (the precise fingerprint ofdocker --gpus '\"device=N\"').patch_torch_compile's non-debug branch used to unconditionallyos.environ.pop(\"TORCHINDUCTOR_COMPILE_THREADS\")which undid the forcing and let Inductor respawn the compile worker subprocess pool. The pool's workers cannot enumerate the cgroup-pinned GPU and raise:Honour the sentinel: skip the pop when the user has explicitly forced single-worker.
Honour
TORCHINDUCTOR_COMPILE_THREADS=1indetermine_compile_threads. The Inductor options dict intemporary_patches/common.py(used byget_torch_compile_options) callsdetermine_compile_threads()which previously read onlycpu_count. Now it returns1when the env var is1, matching the explicit forcing path.Force longest-padding on Gemma3Processor calls with ragged text batches. TRL 0.24 paged GRPO + reward call paths invoke
Gemma3Processor(text=[...], images=[...])with nopadding=kwarg. UpstreamGemma3ProcessorKwargs._defaults['text_kwargs']['padding']=False, so ragged completions blow upBatchFeature(..., tensor_type=\"pt\")stacking input_ids of different lengths:Set
output_kwargs['text_kwargs']['padding']='longest'only when the caller did not pinpadding=AND we actually have more than one text row. Single-image inference is byte-identical.Repros (all now pass under PR #5748's image):
nb/Mistral_v0.3_(7B)-CPT.ipynbnb/gpt-oss-(20B)-Fine-tuning.ipynbnb/Gemma3_(4B)-Vision-GRPO.ipynbnb/Qwen3_VL_(8B)-Vision-GRPO.ipynbCompat
padding=or fall into the >1-row branch.--gpus all, Colab, Kaggle, Runpod: untouched (env fingerprint not met).UNSLOTH_FORCE_SINGLE_COMPILE_WORKER=0.Test plan
import unslothsucceeds,torch._inductor.config.compile_threads == 1under--gpus '\"device=N\"'