Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76
Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76danielhanchen wants to merge 4 commits into
Conversation
Two related runtime-side fixes for unslothai#5106 ("model loaded fully on RAM instead of VRAM"): 1. GPU pin threshold bump 0.90 -> 0.95 ------------------------------------- ``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server`` used a ``pool * 0.90`` threshold to decide whether the model fits on GPU. Models that needed 91-94% of free VRAM were classified as "does not fit", so Studio set ``gpu_indices = None`` and shipped ``--fit on`` to llama-server without ``-ngl``. The unsloth llama.cpp fork's ``--fit on`` then ran with its default ``--fit-target 1024`` (1 GiB margin per device, an upstream default inherited from ggml-org#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU -- slow inference for users whose models would have loaded comfortably with ``-ngl -1``. The classic reproducer from unslothai#5106 (noahterbest's log): GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096, GPUs free: [(0, 22805)], selected: None, fit: True 20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as the safety net for the genuinely-too-large case. The auto-ctx fallback also re-checks fit at 4096 before handing off to ``--fit on``: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to ``min(4096, ctx)``, but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted ``--fit on``. ``_fit_context_to_vram``'s 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. ``_select_gpus`` decides whether to pin at all, where being conservative pushes layers to CPU. 2. Belt-and-suspenders: warn on silent CPU fallback --------------------------------------------------- After ``_wait_for_health`` succeeds, scan llama-server's stdout for ``model buffer size`` lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning citing unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]`` field surfaces the result for any future API consumer. This catches runtime-load failures the install-time fix cannot cover (cudart bundle pairing PR unslothai#5322 is the install-side companion): user overriding ``--fit-target``, uncommon driver + toolkit configurations, future regressions in the install path. Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py: * TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and explicit ctx pins to GPU at 94%); guard against threshold over- broadening (genuine overflow still falls back to ``--fit on``). * TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers return True; CPU-only buffer lines return False; absent buffer lines or no GPUs detected return None (no warning). 25 context-fit tests pass (15 baseline + 10 new). 511 tests total across the affected test files. No regressions. Refs unslothai#5106
for more information, see https://pre-commit.ci
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request increases the VRAM utilization threshold for pinning models to the GPU from 90% to 95% and introduces a fallback mechanism that attempts to fit models with a reduced context length (4096) before deferring to CPU offloading. It also adds a diagnostic feature to detect and warn about silent CPU fallbacks by parsing server logs for GPU buffer markers. The review feedback identifies opportunities to improve efficiency by moving constant calculations out of loops in both the core logic and the test suite.
| for n_gpus in range(1, len(ranked) + 1): | ||
| subset = ranked[:n_gpus] | ||
| pool_mib = sum(free for _, free in subset) | ||
| kv = self._estimate_kv_cache_bytes( | ||
| effective_ctx, | ||
| cache_type_kv, | ||
| n_parallel = n_parallel, | ||
| ) | ||
| total_mib = (model_size + kv) / (1024 * 1024) | ||
| if total_mib <= pool_mib * pin_fraction: |
There was a problem hiding this comment.
In this re-check loop, effective_ctx is fixed (either 4096 or its original value if smaller). Consequently, the KV cache estimation and the resulting total_mib are constant across all iterations of the GPU subset loop. Moving these calculations outside the loop improves efficiency slightly.
| for n_gpus in range(1, len(ranked) + 1): | |
| subset = ranked[:n_gpus] | |
| pool_mib = sum(free for _, free in subset) | |
| kv = self._estimate_kv_cache_bytes( | |
| effective_ctx, | |
| cache_type_kv, | |
| n_parallel = n_parallel, | |
| ) | |
| total_mib = (model_size + kv) / (1024 * 1024) | |
| if total_mib <= pool_mib * pin_fraction: | |
| kv = self._estimate_kv_cache_bytes( | |
| effective_ctx, | |
| cache_type_kv, | |
| n_parallel = n_parallel, | |
| ) | |
| total_mib = (model_size + kv) / (1024 * 1024) | |
| for n_gpus in range(1, len(ranked) + 1): | |
| subset = ranked[:n_gpus] | |
| pool_mib = sum(free for _, free in subset) | |
| if total_mib <= pool_mib * pin_fraction: |
| for n_gpus in range(1, len(ranked) + 1): | ||
| subset = ranked[:n_gpus] | ||
| pool_mib = sum(free for _, free in subset) | ||
| kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv) | ||
| total_mib = (model_size + kv) / (1024 * 1024) | ||
| if total_mib <= pool_mib * pin_fraction: |
There was a problem hiding this comment.
Similar to the implementation in llama_cpp.py, the KV cache estimation and total_mib calculation can be moved outside the loop since effective_ctx is constant here.
| for n_gpus in range(1, len(ranked) + 1): | |
| subset = ranked[:n_gpus] | |
| pool_mib = sum(free for _, free in subset) | |
| kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv) | |
| total_mib = (model_size + kv) / (1024 * 1024) | |
| if total_mib <= pool_mib * pin_fraction: | |
| if effective_ctx > 0: | |
| kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv) | |
| total_mib = (model_size + kv) / (1024 * 1024) | |
| for n_gpus in range(1, len(ranked) + 1): | |
| subset = ranked[:n_gpus] | |
| pool_mib = sum(free for _, free in subset) | |
| if total_mib <= pool_mib * pin_fraction: |
e128c6f to
1555c15
Compare
Staging mirror of unslothai#5323
Original PR: unslothai#5323
Author: danielhanchen
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Summary
Two related runtime-side fixes for unslothai#5106 ("model loaded fully on RAM instead of VRAM"). Companion to unslothai#5322, which fixes the Windows install-time half (paired cudart bundle).
1. Bump GPU pin threshold from 0.90 to 0.95
_select_gpusand the auto-ctx pin loop instart_llama_serverused apool * 0.90threshold to decide whether the model fits on GPU. Models that need 91-94% of free VRAM were classified as "does not fit", so Studio setgpu_indices = Noneand emitted--fit onto llama-server without-ngl.The unsloth llama.cpp fork's
--fit onthen ran with its default--fit-target 1024(1 GiB margin per device, an upstream default inherited from ggml-org/llama.cpp#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU. Slow inference for users whose models would have loaded comfortably with-ngl -1.The reproducer from noahterbest in unslothai#5106:
20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. With this change the same case stays in the fits-on-GPU branch and Studio emits
-ngl -1directly.The auto-ctx fallback also re-checks fit at 4096 before handing off to
--fit on: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back tomin(4096, ctx), but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted--fit on._fit_context_to_vram's 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime._select_gpusdecides whether to pin at all, where being conservative pushes layers to CPU.2. Warn on silent CPU fallback after load
After
_wait_for_healthsucceeds, scan llama-server's stdout formodel buffer sizelines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning that cites unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New_gpu_offload_active: Optional[bool]field surfaces the result for any future API consumer.This catches runtime-load failures the install-time fix in unslothai#5322 cannot cover: user overriding
--fit-target, uncommon driver + toolkit configurations, future regressions in the install path.Test plan
python -m pytest studio/backend/tests/test_llama_cpp_context_fit.py(25 passed: 15 baseline + 10 new)python -m pytest studio/backend/tests/test_llama_cpp_max_context_threshold.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_kv_cache_estimation.py(all green)This PR tracks the moving review branch (pr-5323-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.
Changed files:
.github/workflows/studio-backend-ci.yml.github/workflows/studio-frontend-ci.yml.github/workflows/studio-inference-smoke.yml.github/workflows/studio-tauri-smoke.yml.github/workflows/wheel-smoke.ymlstudio/backend/core/inference/llama_cpp.pystudio/backend/tests/test_llama_cpp_context_fit.py