Skip to content

Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76

Open
danielhanchen wants to merge 4 commits into
mainfrom
pr-5323-head
Open

Studio: pin GPU at 95% headroom and warn on silent CPU fallback#76
danielhanchen wants to merge 4 commits into
mainfrom
pr-5323-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#5323

Original PR: unslothai#5323
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

Two related runtime-side fixes for unslothai#5106 ("model loaded fully on RAM instead of VRAM"). Companion to unslothai#5322, which fixes the Windows install-time half (paired cudart bundle).

1. Bump GPU pin threshold from 0.90 to 0.95

_select_gpus and the auto-ctx pin loop in start_llama_server used a pool * 0.90 threshold to decide whether the model fits on GPU. Models that need 91-94% of free VRAM were classified as "does not fit", so Studio set gpu_indices = None and emitted --fit on to llama-server without -ngl.

The unsloth llama.cpp fork's --fit on then ran with its default --fit-target 1024 (1 GiB margin per device, an upstream default inherited from ggml-org/llama.cpp#18679). On a tight fit where compute buffers + CUDA context push the projected free below the 1 GiB target, the fork's fit logic shaves layer weights off the GPU. Slow inference for users whose models would have loaded comfortably with -ngl -1.

The reproducer from noahterbest in unslothai#5106:

GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096,
GPUs free: [(0, 22805)], selected: None, fit: True

20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit mode. With this change the same case stays in the fits-on-GPU branch and Studio emits -ngl -1 directly.

The auto-ctx fallback also re-checks fit at 4096 before handing off to --fit on: a 20.8 GiB model with a 131072 native context fails the auto loop at native ctx, falls back to min(4096, ctx), but its weights + 4096 KV pin to the GPU comfortably. Without the re-check we still emitted --fit on.

_fit_context_to_vram's 0.90 budget for context binary search is intentionally left tighter than the pin fraction. That routine chooses the slider value, where over-promising would OOM at runtime. _select_gpus decides whether to pin at all, where being conservative pushes layers to CPU.

2. Warn on silent CPU fallback after load

After _wait_for_health succeeds, scan llama-server's stdout for model buffer size lines. If Studio detected GPUs and intended GPU use but only CPU buffers were allocated, log a structured warning that cites unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan / OpenCL / SYCL backends. New _gpu_offload_active: Optional[bool] field surfaces the result for any future API consumer.

This catches runtime-load failures the install-time fix in unslothai#5322 cannot cover: user overriding --fit-target, uncommon driver + toolkit configurations, future regressions in the install path.

Test plan

  • python -m pytest studio/backend/tests/test_llama_cpp_context_fit.py (25 passed: 15 baseline + 10 new)
  • python -m pytest studio/backend/tests/test_llama_cpp_max_context_threshold.py studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_kv_cache_estimation.py (all green)
  • Validate on an RTX 4090 (24 GB) with a ~20 GB model: GP

This PR tracks the moving review branch (pr-5323-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.

Changed files:

  • .github/workflows/studio-backend-ci.yml
  • .github/workflows/studio-frontend-ci.yml
  • .github/workflows/studio-inference-smoke.yml
  • .github/workflows/studio-tauri-smoke.yml
  • .github/workflows/wheel-smoke.yml
  • studio/backend/core/inference/llama_cpp.py
  • studio/backend/tests/test_llama_cpp_context_fit.py

danielhanchen and others added 4 commits May 7, 2026 10:39
Two related runtime-side fixes for unslothai#5106 ("model
loaded fully on RAM instead of VRAM"):

1. GPU pin threshold bump 0.90 -> 0.95
-------------------------------------

``_select_gpus`` and the auto-ctx pin loop in ``start_llama_server``
used a ``pool * 0.90`` threshold to decide whether the model fits on
GPU. Models that needed 91-94% of free VRAM were classified as "does
not fit", so Studio set ``gpu_indices = None`` and shipped
``--fit on`` to llama-server without ``-ngl``. The unsloth
llama.cpp fork's ``--fit on`` then ran with its default
``--fit-target 1024`` (1 GiB margin per device, an upstream default
inherited from ggml-org#18679). On a tight fit where compute
buffers + CUDA context push the projected free below the 1 GiB
target, the fork's fit logic shaves layer weights off the GPU --
slow inference for users whose models would have loaded comfortably
with ``-ngl -1``.

The classic reproducer from unslothai#5106 (noahterbest's log):

    GGUF size: 20.8 GB, est. KV cache: 0.1 GB, context: 4096,
    GPUs free: [(0, 22805)], selected: None, fit: True

20.8 GiB on a 22.27 GiB free RTX 4090 is 94% utilization. The model
fits (1.4 GiB headroom), but the 0.90 threshold kicks it to fit
mode. Bumping to 0.95 keeps these in the fits-on-GPU branch and
emits ``-ngl -1`` directly. The fork's ``--fit on`` still serves as
the safety net for the genuinely-too-large case.

The auto-ctx fallback also re-checks fit at 4096 before handing off
to ``--fit on``: a 20.8 GiB model with a 131072 native context fails
the auto loop at native ctx, falls back to ``min(4096, ctx)``, but
its weights + 4096 KV pin to the GPU comfortably. Without the
re-check we still emitted ``--fit on``.

``_fit_context_to_vram``'s 0.90 budget for context binary search is
intentionally left tighter than the pin fraction. That routine
chooses the slider value, where over-promising would OOM at runtime.
``_select_gpus`` decides whether to pin at all, where being
conservative pushes layers to CPU.

2. Belt-and-suspenders: warn on silent CPU fallback
---------------------------------------------------

After ``_wait_for_health`` succeeds, scan llama-server's stdout for
``model buffer size`` lines. If Studio detected GPUs and intended
GPU use but only CPU buffers were allocated, log a structured
warning citing unslothai#5106. Markers cover CUDA / ROCm / Metal / Vulkan /
OpenCL / SYCL backends. New ``_gpu_offload_active: Optional[bool]``
field surfaces the result for any future API consumer.

This catches runtime-load failures the install-time fix cannot
cover (cudart bundle pairing PR unslothai#5322 is the install-side
companion): user overriding ``--fit-target``, uncommon driver +
toolkit configurations, future regressions in the install path.

Tests: 10 new cases in studio/backend/tests/test_llama_cpp_context_fit.py:
* TestTightFitPinsToGPU x3: noahterbest's exact reproducer (auto and
  explicit ctx pins to GPU at 94%); guard against threshold over-
  broadening (genuine overflow still falls back to ``--fit on``).
* TestClassifyGpuOffload x7: CUDA / ROCm / Metal buffer markers
  return True; CPU-only buffer lines return False; absent buffer
  lines or no GPUs detected return None (no warning).

25 context-fit tests pass (15 baseline + 10 new). 511 tests total
across the affected test files. No regressions.

Refs unslothai#5106
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request increases the VRAM utilization threshold for pinning models to the GPU from 90% to 95% and introduces a fallback mechanism that attempts to fit models with a reduced context length (4096) before deferring to CPU offloading. It also adds a diagnostic feature to detect and warn about silent CPU fallbacks by parsing server logs for GPU buffer markers. The review feedback identifies opportunities to improve efficiency by moving constant calculations out of loops in both the core logic and the test suite.

Comment on lines +2095 to +2104
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
kv = self._estimate_kv_cache_bytes(
effective_ctx,
cache_type_kv,
n_parallel = n_parallel,
)
total_mib = (model_size + kv) / (1024 * 1024)
if total_mib <= pool_mib * pin_fraction:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

In this re-check loop, effective_ctx is fixed (either 4096 or its original value if smaller). Consequently, the KV cache estimation and the resulting total_mib are constant across all iterations of the GPU subset loop. Moving these calculations outside the loop improves efficiency slightly.

Suggested change
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
kv = self._estimate_kv_cache_bytes(
effective_ctx,
cache_type_kv,
n_parallel = n_parallel,
)
total_mib = (model_size + kv) / (1024 * 1024)
if total_mib <= pool_mib * pin_fraction:
kv = self._estimate_kv_cache_bytes(
effective_ctx,
cache_type_kv,
n_parallel = n_parallel,
)
total_mib = (model_size + kv) / (1024 * 1024)
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
if total_mib <= pool_mib * pin_fraction:

Comment on lines +217 to +222
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)
total_mib = (model_size + kv) / (1024 * 1024)
if total_mib <= pool_mib * pin_fraction:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

Similar to the implementation in llama_cpp.py, the KV cache estimation and total_mib calculation can be moved outside the loop since effective_ctx is constant here.

Suggested change
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)
total_mib = (model_size + kv) / (1024 * 1024)
if total_mib <= pool_mib * pin_fraction:
if effective_ctx > 0:
kv = inst._estimate_kv_cache_bytes(effective_ctx, cache_type_kv)
total_mib = (model_size + kv) / (1024 * 1024)
for n_gpus in range(1, len(ranked) + 1):
subset = ranked[:n_gpus]
pool_mib = sum(free for _, free in subset)
if total_mib <= pool_mib * pin_fraction:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant