Studio: pin GPU at 95% headroom and warn on silent CPU fallback by danielhanchen · Pull Request #76 · unslothai/unsloth-staging-1

danielhanchen · 2026-05-10T12:43:07Z

Staging mirror of unslothai#5323

Original PR: unslothai#5323
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Summary

Two related runtime-side fixes for unslothai#5106 ("model loaded fully on RAM instead of VRAM"). Companion to unslothai#5322, which fixes the Windows install-time half (paired cudart bundle).

1. Bump GPU pin threshold from 0.90 to 0.95

_select_gpus and the auto-ctx pin loop in start_llama_server used a pool * 0.90 threshold to decide whether the model fits on GPU. Models that need 91-94% of free VRAM were classified as "does not fit", so Studio set gpu_indices = None and emitted --fit on to llama-server without -ngl.

The unsloth llama.cpp fork's --fit on then ran with its default --fit-target 1024 (1 GiB margin per device, an upstream default inherited from ggml-org/llama.cpp#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU. Slow inference for users whose models would have loaded comfortably with -ngl -1.

The reproducer from noahterbest in unslothai#5106:

GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096,
GPUs free: [(0, 22805)], selected: None, fit: True

20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. With this change the same case stays in the fits-on-GPU branch and Studio emits -ngl -1 directly.

The auto-ctx fallback also re-checks fit at 4096 before handing off to --fit on: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to min(4096, ctx), but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted --fit on.

_fit_context_to_vram's 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. _select_gpus decides whether to pin at all, where being conservative pushes layers to CPU.

2. Warn on silent CPU fallback after load

After _wait_for_health succeeds, scan llama-server's stdout for model buffer size lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning that cites unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New _gpu_offload_active: Optional[bool] field surfaces the result for any future API consumer.

This catches runtime-load failures the install-time fix in unslothai#5322 cannot cover: user overriding --fit-target, uncommon driver + toolkit configurations, future regressions in the install path.

Test plan

python -m pytest studio/backend/tests/test_llama_cpp_context_fit.py (25 passed: 15 baseline + 10 new)
python -m pytest studio/backend/tests/test_llama_cpp_max_context_threshold.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_kv_cache_estimation.py (all green)
Validate on an RTX 4090 (24 GB) with a ~20 GB model: GP

This PR tracks the moving review branch (pr-5323-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.

Changed files:

.github/workflows/studio-backend-ci.yml
.github/workflows/studio-frontend-ci.yml
.github/workflows/studio-inference-smoke.yml
.github/workflows/studio-tauri-smoke.yml
.github/workflows/wheel-smoke.yml
studio/backend/core/inference/llama_cpp.py
studio/backend/tests/test_llama_cpp_context_fit.py

Two related runtime-side fixes for unslothai#5106 ("model loaded fully on RAM instead of VRAM"): 1. GPU pin threshold bump 0.90 -> 0.95 ------------------------------------- ``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server`` used a ``pool * 0.90`` threshold to decide whether the model fits on GPU. Models that needed 91-94% of free VRAM were classified as "does not fit", so Studio set ``gpu_indices = None`` and shipped ``--fit on`` to llama-server without ``-ngl``. The unsloth llama.cpp fork's ``--fit on`` then ran with its default ``--fit-target 1024`` (1 GiB margin per device, an upstream default inherited from ggml-org#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU -- slow inference for users whose models would have loaded comfortably with ``-ngl -1``. The classic reproducer from unslothai#5106 (noahterbest's log): GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096, GPUs free: [(0, 22805)], selected: None, fit: True 20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as the safety net for the genuinely-too-large case. The auto-ctx fallback also re-checks fit at 4096 before handing off to ``--fit on``: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to ``min(4096, ctx)``, but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted ``--fit on``. ``_fit_context_to_vram``'s 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. ``_select_gpus`` decides whether to pin at all, where being conservative pushes layers to CPU. 2. Belt-and-suspenders: warn on silent CPU fallback --------------------------------------------------- After ``_wait_for_health`` succeeds, scan llama-server's stdout for ``model buffer size`` lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning citing unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]`` field surfaces the result for any future API consumer. This catches runtime-load failures the install-time fix cannot cover (cudart bundle pairing PR unslothai#5322 is the install-side companion): user overriding ``--fit-target``, uncommon driver + toolkit configurations, future regressions in the install path. Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py: * TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and explicit ctx pins to GPU at 94%); guard against threshold over- broadening (genuine overflow still falls back to ``--fit on``). * TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers return True; CPU-only buffer lines return False; absent buffer lines or no GPUs detected return None (no warning). 25 context-fit tests pass (15 baseline + 10 new). 511 tests total across the affected test files. No regressions. Refs unslothai#5106

for more information, see https://pre-commit.ci

danielhanchen · 2026-05-10T12:45:32Z

/gemini review

gemini-code-assist

Code Review

This pull request increases the VRAM utilization threshold for pinning models to the GPU from 90% to 95% and introduces a fallback mechanism that attempts to fit models with a reduced context length (4096) before deferring to CPU offloading. It also adds a diagnostic feature to detect and warn about silent CPU fallbacks by parsing server logs for GPU buffer markers. The review feedback identifies opportunities to improve efficiency by moving constant calculations out of loops in both the core logic and the test suite.

gemini-code-assist · 2026-05-10T12:47:44Z

+                                for n_gpus in range(1, len(ranked) + 1):
+                                    subset = ranked[:n_gpus]
+                                    pool_mib = sum(free for _, free in subset)
+                                    kv = self._estimate_kv_cache_bytes(
+                                        effective_ctx,
+                                        cache_type_kv,
+                                        n_parallel = n_parallel,
+                                    )
+                                    total_mib = (model_size + kv) / (1024 * 1024)
+                                    if total_mib <= pool_mib * pin_fraction:


In this re-check loop, effective_ctx is fixed (either 4096 or its original value if smaller). Consequently, the KV cache estimation and the resulting total_mib are constant across all iterations of the GPU subset loop. Moving these calculations outside the loop improves efficiency slightly.

Suggested change

for n_gpus in range(1, len(ranked) + 1):

subset = ranked[:n_gpus]

pool_mib = sum(free for _, free in subset)

kv = self._estimate_kv_cache_bytes(

effective_ctx,

cache_type_kv,

n_parallel = n_parallel,

)

total_mib = (model_size + kv) / (1024 * 1024)

if total_mib <= pool_mib * pin_fraction:

kv = self._estimate_kv_cache_bytes(

effective_ctx,

cache_type_kv,

n_parallel = n_parallel,

)

total_mib = (model_size + kv) / (1024 * 1024)

for n_gpus in range(1, len(ranked) + 1):

subset = ranked[:n_gpus]

pool_mib = sum(free for _, free in subset)

if total_mib <= pool_mib * pin_fraction:

gemini-code-assist · 2026-05-10T12:47:45Z

+                    for n_gpus in range(1, len(ranked) + 1):
+                        subset = ranked[:n_gpus]
+                        pool_mib = sum(free for _, free in subset)
+                        kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)
+                        total_mib = (model_size + kv) / (1024 * 1024)
+                        if total_mib <= pool_mib * pin_fraction:


Similar to the implementation in llama_cpp.py, the KV cache estimation and total_mib calculation can be moved outside the loop since effective_ctx is constant here.

Suggested change

for n_gpus in range(1, len(ranked) + 1):

subset = ranked[:n_gpus]

pool_mib = sum(free for _, free in subset)

kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)

total_mib = (model_size + kv) / (1024 * 1024)

if total_mib <= pool_mib * pin_fraction:

if effective_ctx > 0:

kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)

total_mib = (model_size + kv) / (1024 * 1024)

for n_gpus in range(1, len(ranked) + 1):

subset = ranked[:n_gpus]

pool_mib = sum(free for _, free in subset)

if total_mib <= pool_mib * pin_fraction:

danielhanchen and others added 4 commits May 7, 2026 10:39

[pre-commit.ci] auto fixes from pre-commit.com hooks

2e9bb84

for more information, see https://pre-commit.ci

Trim comments to be more succinct

c9d1ccc

Scrub .github/workflows for staging push (matches staging base)

e70c1cd

gemini-code-assist Bot reviewed May 10, 2026

View reviewed changes

danielhanchen force-pushed the main branch 3 times, most recently from e128c6f to 1555c15 Compare May 18, 2026 03:46

danielhanchen force-pushed the main branch from 9f47625 to b9dd7cf Compare June 7, 2026 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76

Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76
danielhanchen wants to merge 4 commits into
mainfrom
pr-5323-head

danielhanchen commented May 10, 2026

Uh oh!

danielhanchen commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

gemini-code-assist Bot May 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

danielhanchen commented May 10, 2026

Staging mirror of unslothai#5323

Original description

Summary

1. Bump GPU pin threshold from 0.90 to 0.95

2. Warn on silent CPU fallback after load

Test plan

Uh oh!

danielhanchen commented May 10, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 10, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant