Studio: tools, thinking blocks, code execution and web search for safetensors by danielhanchen · Pull Request #84 · unslothai/unsloth-staging-1

danielhanchen · 2026-05-18T03:46:27Z

Staging mirror of unslothai#5520

Original PR: unslothai#5520
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.

Original description

Summary

Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.

What changes

Backend:

core/inference/tool_call_parser.py (new) -- backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal and the shared regex set. LlamaCppBackend._parse_tool_calls_from_text delegates here so both backends fix-forward together.
core/inference/safetensors_agentic.py (new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the same status / content / tool_start / tool_end / metadata events as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error nudge, cancel_event, max_tool_iterations cap and a final-answer attempt.
core/inference/inference.py -- generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking. New _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. New generate_chat_completion_with_tools wraps the agentic loop.
core/inference/orchestrator.py -- forwards the new kwargs through both IPC paths (gen and dispatched); new generate_chat_completion_with_tools drives the loop from the parent process so tools run alongside the existing route-layer plumbing.
core/inference/worker.py -- pulls tools / enable_thinking / reasoning_effort / preserve_thinking from the cmd dict when present and forwards them to backend.generate_chat_response.
routes/inference.py -- shared _detect_safetensors_features helper that calls the existing detect_reasoning_flags on the loaded tokenizer template so /load, the already_loaded branch and /status all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case.

Tests:

tests/test_safetensors_tool_loop.py -- 22 tests covering parser shapes (closed/unclosed JSON, <function=...> XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, t

This PR tracks the moving review branch (pr-5520-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.

Changed files:

.github/workflows/consolidated-tests-ci.yml
.github/workflows/lint-ci.yml
.github/workflows/mlx-ci.yml
.github/workflows/notebooks-ci.yml
.github/workflows/release-desktop.yml
.github/workflows/security-audit.yml
.github/workflows/stale.yml
.github/workflows/studio-api-smoke.yml
.github/workflows/studio-backend-ci.yml
.github/workflows/studio-frontend-ci.yml
.github/workflows/studio-inference-smoke.yml
.github/workflows/studio-mac-api-smoke.yml
.github/workflows/studio-mac-inference-smoke.yml
.github/workflows/studio-mac-ui-smoke.yml
.github/workflows/studio-mac-update-smoke.yml
.github/workflows/studio-tauri-smoke.yml
.github/workflows/studio-ui-smoke.yml
.github/workflows/studio-update-smoke.yml
.github/workflows/studio-windows-api-smoke.yml
.github/workflows/studio-windows-inference-smoke.yml
.github/workflows/studio-windows-ui-smoke.yml
.github/workflows/studio-windows-update-smoke.yml
.github/workflows/version-compat-ci.yml
.github/workflows/wheel-smoke.yml
studio/backend/core/inference/chat_template_helpers.py
studio/backend/core/inference/inference.py
studio/backend/core/inference/llama_cpp.py
studio/backend/core/inference/orchestrator.py
studio/backend/core/inference/safetensors_agentic.py
studio/backend/core/inference/tool_call_parser.py

The GGUF/llama-server backend already streams tool_start/tool_end events, strips <tool_call> XML, parses <think> blocks, and runs an agentic loop through web_search / python / terminal. The transformers/safetensors backend was inference-only: no tools, no template-level reasoning controls, no agentic loop. This change brings safetensors to parity for non-vision text chat while leaving the GGUF path untouched. Backend changes: - core/inference/tool_call_parser.py (new): backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text delegates here, so both paths fix-forward together. - core/inference/safetensors_agentic.py (new): cumulative-text agentic loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields the same status / content / tool_start / tool_end / metadata events the GGUF path already emits. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error-prefix tagging, cancel_event, and max_tool_iterations capping. - core/inference/inference.py: generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking; _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first). New generate_chat_completion_with_tools method wraps the agentic loop. - core/inference/orchestrator.py: forwards the new kwargs through IPC (gen + dispatched paths); adds generate_chat_completion_with_tools that drives the loop from the parent process. - core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/ preserve_thinking from the cmd dict when present and forwards to backend.generate_chat_response. - routes/inference.py: shared _detect_safetensors_features helper that calls detect_reasoning_flags on the loaded tokenizer template so the load/already_loaded/status endpoints all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions that mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case. Tests: - tests/test_safetensors_tool_loop.py: 22 tests covering parser shapes (closed/unclosed JSON, function/parameter XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop control flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {"query": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tool error nudge, raised exceptions caught), and control (cancel_event break, max_tool_iterations cap). Backwards compatibility: - LlamaCppBackend._parse_tool_calls_from_text keeps the same signature and behaviour. - All new IPC kwargs are optional and only added to the cmd dict when set, so older worker payloads are unaffected. - The SSE event protocol matches the existing GGUF tool stream so the frontend tool UI works unchanged.

for more information, see https://pre-commit.ci

…e helper Two follow-ups from the comprehensive simulation pass: 1. Bug: assistant prose containing the literal string "<tool_call>" was silently truncated. The STREAMING end-of-stream branch re-yielded the cumulative content with ``strip_tool_markup(..., final=True)`` whenever the parser found no real tool calls. ``final=True`` removes any trailing unclosed ``<tool_call>.*$`` run, which dropped legitimate prose mentioning the literal text (e.g. "the docs say <tool_call> means an LLM tool"). The streaming pass already emitted the cleaned cumulative content via partial strips, so the final re-yield was redundant and only ever hid real text. Drop it; the DRAINING-no-parse fallback now surfaces the raw content_accum instead of the final-stripped version. Adds regression tests covering both the prose case and the case where the tool RESULT text contains the literal "<tool_call>" (the loop must only parse model output, not tool results). 2. Extract _apply_chat_template_for_generation into core/inference/chat_template_helpers.apply_chat_template_for_generation so its kwarg-fallback chain (richest call first, peel off groups on TypeError, propagate real Jinja errors) can be unit-tested without pulling unsloth / torch / transformers into the sandbox. InferenceBackend's method becomes a thin delegate. Tests: - TestProseMentioningToolCall: two new tests for the truncation regression and the tool-result-text safety case. - TestChatTemplateHelper: five new tests for the helper's fallback chain across template-kwarg permutations and the Jinja-error propagate behaviour. All 29 tests in test_safetensors_tool_loop.py pass; the full related suite (202 tests across test_safetensors_tool_loop, test_openai_tool_ passthrough, test_responses_tool_passthrough, test_inference_model_ validation, test_anthropic_thinking_translation, test_anthropic_code_ execution, test_anthropic_messages) is green.

for more information, see https://pre-commit.ci

+                _sf_tracker.__exit__(None, None, None)
+
+        return StreamingResponse(
+            sf_tool_stream(),


+            ],
+            exec_results = ["result"],
+        )
+        events = _collect_events(loop)


+            ],
+            exec_results = ["..."],
+        )
+        events = _collect_events(loop)


+            ],
+            exec_results = ["Page text: <tool_call> appears here in the docs"],
+        )
+        events = _collect_events(loop)


+            flags["supports_reasoning"] = True
+            flags["reasoning_style"] = "reasoning_effort"
+            flags["supports_tools"] = False
+    except Exception:


+            parsed = json.loads(raw_args)
+            if isinstance(parsed, dict):
+                return parsed
+        except (json.JSONDecodeError, ValueError):


+                        tc["function"]["arguments"]
+                    )
+                tool_calls.append(tc)
+            except (json.JSONDecodeError, ValueError):


danielhanchen · 2026-05-18T03:56:57Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces agentic tool-calling and reasoning controls for the safetensors/transformers backend, providing feature parity with the GGUF implementation. The changes include a new backend-neutral tool-call parser, an agentic loop for cumulative text generators, and updates to the inference orchestrator and API routes to support these features. Additionally, a version flag was added to the CLI. Review feedback points out unreachable code in the chat template helper and suggests a more robust approach for duplicate tool call detection using sorted JSON serialization.

gemini-code-assist · 2026-05-18T03:59:07Z

+                "arguments": arguments,
+            }
+
+            tc_key = tool_name + str(arguments)


Using str(arguments) to generate a key for duplicate tool call detection is unreliable. In Python 3.7+, dictionaries preserve insertion order. This means two dictionaries with identical keys and values but different insertion orders (which can happen if the model outputs JSON keys in a different sequence) will produce different strings from str(), causing the duplicate detection to fail. Using json.dumps(arguments, sort_keys=True) ensures a stable, canonical key regardless of insertion order.

Suggested change

tc_key = tool_name + str(arguments)

tc_key = tool_name + json.dumps(arguments, sort_keys = True)

gemini-code-assist · 2026-05-18T03:59:07Z

+    return tokenizer.apply_chat_template(
+        messages, tokenize = False, add_generation_prompt = True
+    )


This block is unreachable. The attempts list (line 46) always includes an empty dictionary {} as its final element. If tokenizer.apply_chat_template succeeds with this empty dictionary, it returns from within the loop (line 51). If it fails with a TypeError, the error is caught and stored in last_exc, the loop terminates, and last_exc is raised at line 64. If it fails with any other Exception, the loop breaks and raises at line 64. Consequently, execution can never reach lines 65-67.

danielhanchen and others added 5 commits May 17, 2026 13:56

[pre-commit.ci] auto fixes from pre-commit.com hooks

6cc83da

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

c0dee98

for more information, see https://pre-commit.ci

Scrub .github/workflows for staging push (matches staging base)

3d9b045

github-advanced-security AI found potential problems May 18, 2026

View reviewed changes

Comment thread studio/backend/routes/inference.py

_sf_tracker.__exit__(None, None, None)

return StreamingResponse(

sf_tool_stream(),

github-code-quality Bot found potential problems May 18, 2026

View reviewed changes

Merge origin/main into head

3aa8b63

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

danielhanchen force-pushed the main branch from 9f47625 to b9dd7cf Compare June 7, 2026 10:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Studio: tools, thinking blocks, code execution and web search for safetensors#84

Studio: tools, thinking blocks, code execution and web search for safetensors#84
danielhanchen wants to merge 6 commits into
mainfrom
pr-5520-head

danielhanchen commented May 18, 2026

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	tc_key = tool_name + str(arguments)
	tc_key = tool_name + json.dumps(arguments, sort_keys = True)

Conversation

danielhanchen commented May 18, 2026

Staging mirror of unslothai#5520

Original description

Summary

What changes

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants