Studio: tools, thinking blocks, code execution and web search for safetensors#84
Studio: tools, thinking blocks, code execution and web search for safetensors#84danielhanchen wants to merge 6 commits into
Conversation
The GGUF/llama-server backend already streams tool_start/tool_end events,
strips <tool_call> XML, parses <think> blocks, and runs an agentic loop
through web_search / python / terminal. The transformers/safetensors
backend was inference-only: no tools, no template-level reasoning
controls, no agentic loop. This change brings safetensors to parity for
non-vision text chat while leaving the GGUF path untouched.
Backend changes:
- core/inference/tool_call_parser.py (new): backend-neutral
parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and
shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text
delegates here, so both paths fix-forward together.
- core/inference/safetensors_agentic.py (new): cumulative-text agentic
loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields
the same status / content / tool_start / tool_end / metadata events
the GGUF path already emits. Handles duplicate-call short-circuit,
__IMAGES__ sentinel stripping before model feedback, error-prefix
tagging, cancel_event, and max_tool_iterations capping.
- core/inference/inference.py: generate_chat_response now accepts
tools / enable_thinking / reasoning_effort / preserve_thinking;
_apply_chat_template_for_generation peels unsupported kwargs off the
template call in safe order (richest first). New
generate_chat_completion_with_tools method wraps the agentic loop.
- core/inference/orchestrator.py: forwards the new kwargs through IPC
(gen + dispatched paths); adds generate_chat_completion_with_tools
that drives the loop from the parent process.
- core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/
preserve_thinking from the cmd dict when present and forwards to
backend.generate_chat_response.
- routes/inference.py: shared _detect_safetensors_features helper that
calls detect_reasoning_flags on the loaded tokenizer template so the
load/already_loaded/status endpoints all advertise the same flags
GGUF does. New safetensors tool-calling SSE branch in
POST /chat/completions that mirrors the GGUF flow (system prompt
nudge, tool subset filtering, stale-XML scrubbing of prior
assistant turns). gpt-oss is gated out of the safetensors tool path
because Harmony uses a dedicated channel for tool calls rather than
<tool_call> XML; GGUF still serves that case.
Tests:
- tests/test_safetensors_tool_loop.py: 22 tests covering parser
shapes (closed/unclosed JSON, function/parameter XML, embedded
</parameter> in code, multiple calls, bad JSON), agentic-loop
control flow (plain answers, single tool then answer, truncated
unclosed call, JSON-string arguments healed to {"query": ...}),
behaviour (duplicate-call short-circuit, image-sentinel survival,
tool error nudge, raised exceptions caught), and control
(cancel_event break, max_tool_iterations cap).
Backwards compatibility:
- LlamaCppBackend._parse_tool_calls_from_text keeps the same signature
and behaviour.
- All new IPC kwargs are optional and only added to the cmd dict when
set, so older worker payloads are unaffected.
- The SSE event protocol matches the existing GGUF tool stream so the
frontend tool UI works unchanged.
for more information, see https://pre-commit.ci
…e helper Two follow-ups from the comprehensive simulation pass: 1. Bug: assistant prose containing the literal string "<tool_call>" was silently truncated. The STREAMING end-of-stream branch re-yielded the cumulative content with ``strip_tool_markup(..., final=True)`` whenever the parser found no real tool calls. ``final=True`` removes any trailing unclosed ``<tool_call>.*$`` run, which dropped legitimate prose mentioning the literal text (e.g. "the docs say <tool_call> means an LLM tool"). The streaming pass already emitted the cleaned cumulative content via partial strips, so the final re-yield was redundant and only ever hid real text. Drop it; the DRAINING-no-parse fallback now surfaces the raw content_accum instead of the final-stripped version. Adds regression tests covering both the prose case and the case where the tool RESULT text contains the literal "<tool_call>" (the loop must only parse model output, not tool results). 2. Extract _apply_chat_template_for_generation into core/inference/chat_template_helpers.apply_chat_template_for_generation so its kwarg-fallback chain (richest call first, peel off groups on TypeError, propagate real Jinja errors) can be unit-tested without pulling unsloth / torch / transformers into the sandbox. InferenceBackend's method becomes a thin delegate. Tests: - TestProseMentioningToolCall: two new tests for the truncation regression and the tool-result-text safety case. - TestChatTemplateHelper: five new tests for the helper's fallback chain across template-kwarg permutations and the Jinja-error propagate behaviour. All 29 tests in test_safetensors_tool_loop.py pass; the full related suite (202 tests across test_safetensors_tool_loop, test_openai_tool_ passthrough, test_responses_tool_passthrough, test_inference_model_ validation, test_anthropic_thinking_translation, test_anthropic_code_ execution, test_anthropic_messages) is green.
for more information, see https://pre-commit.ci
| _sf_tracker.__exit__(None, None, None) | ||
|
|
||
| return StreamingResponse( | ||
| sf_tool_stream(), |
| ], | ||
| exec_results = ["result"], | ||
| ) | ||
| events = _collect_events(loop) |
| ], | ||
| exec_results = ["..."], | ||
| ) | ||
| events = _collect_events(loop) |
| ], | ||
| exec_results = ["Page text: <tool_call> appears here in the docs"], | ||
| ) | ||
| events = _collect_events(loop) |
| flags["supports_reasoning"] = True | ||
| flags["reasoning_style"] = "reasoning_effort" | ||
| flags["supports_tools"] = False | ||
| except Exception: |
| parsed = json.loads(raw_args) | ||
| if isinstance(parsed, dict): | ||
| return parsed | ||
| except (json.JSONDecodeError, ValueError): |
| tc["function"]["arguments"] | ||
| ) | ||
| tool_calls.append(tc) | ||
| except (json.JSONDecodeError, ValueError): |
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces agentic tool-calling and reasoning controls for the safetensors/transformers backend, providing feature parity with the GGUF implementation. The changes include a new backend-neutral tool-call parser, an agentic loop for cumulative text generators, and updates to the inference orchestrator and API routes to support these features. Additionally, a version flag was added to the CLI. Review feedback points out unreachable code in the chat template helper and suggests a more robust approach for duplicate tool call detection using sorted JSON serialization.
| "arguments": arguments, | ||
| } | ||
|
|
||
| tc_key = tool_name + str(arguments) |
There was a problem hiding this comment.
Using str(arguments) to generate a key for duplicate tool call detection is unreliable. In Python 3.7+, dictionaries preserve insertion order. This means two dictionaries with identical keys and values but different insertion orders (which can happen if the model outputs JSON keys in a different sequence) will produce different strings from str(), causing the duplicate detection to fail. Using json.dumps(arguments, sort_keys=True) ensures a stable, canonical key regardless of insertion order.
| tc_key = tool_name + str(arguments) | |
| tc_key = tool_name + json.dumps(arguments, sort_keys = True) |
| return tokenizer.apply_chat_template( | ||
| messages, tokenize = False, add_generation_prompt = True | ||
| ) |
There was a problem hiding this comment.
This block is unreachable. The attempts list (line 46) always includes an empty dictionary {} as its final element. If tokenizer.apply_chat_template succeeds with this empty dictionary, it returns from within the loop (line 51). If it fails with a TypeError, the error is caught and stored in last_exc, the loop terminates, and last_exc is raised at line 64. If it fails with any other Exception, the loop breaks and raises at line 64. Consequently, execution can never reach lines 65-67.
Staging mirror of unslothai#5520
Original PR: unslothai#5520
Author: danielhanchen
This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.
Original description
Summary
Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no
tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.What changes
Backend:
core/inference/tool_call_parser.py(new) -- backend-neutralparse_tool_calls_from_text,strip_tool_markup,has_tool_signaland the shared regex set.LlamaCppBackend._parse_tool_calls_from_textdelegates here so both backends fix-forward together.core/inference/safetensors_agentic.py(new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the samestatus/content/tool_start/tool_end/metadataevents as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit,__IMAGES__sentinel stripping before model feedback, error nudge,cancel_event,max_tool_iterationscap and a final-answer attempt.core/inference/inference.py--generate_chat_responsenow acceptstools/enable_thinking/reasoning_effort/preserve_thinking. New_apply_chat_template_for_generationpeels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. Newgenerate_chat_completion_with_toolswraps the agentic loop.core/inference/orchestrator.py-- forwards the new kwargs through both IPC paths (gen and dispatched); newgenerate_chat_completion_with_toolsdrives the loop from the parent process so tools run alongside the existing route-layer plumbing.core/inference/worker.py-- pullstools/enable_thinking/reasoning_effort/preserve_thinkingfrom the cmd dict when present and forwards them tobackend.generate_chat_response.routes/inference.py-- shared_detect_safetensors_featureshelper that calls the existingdetect_reasoning_flagson the loaded tokenizer template so/load, thealready_loadedbranch and/statusall advertise the same flags GGUF does. New safetensors tool-calling SSE branch inPOST /chat/completionsmirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than<tool_call>XML; GGUF still serves that case.Tests:
tests/test_safetensors_tool_loop.py-- 22 tests covering parser shapes (closed/unclosed JSON,<function=...>XML, embedded</parameter>in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to{\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, tThis PR tracks the moving review branch (pr-5520-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.
Changed files:
.github/workflows/consolidated-tests-ci.yml.github/workflows/lint-ci.yml.github/workflows/mlx-ci.yml.github/workflows/notebooks-ci.yml.github/workflows/release-desktop.yml.github/workflows/security-audit.yml.github/workflows/stale.yml.github/workflows/studio-api-smoke.yml.github/workflows/studio-backend-ci.yml.github/workflows/studio-frontend-ci.yml.github/workflows/studio-inference-smoke.yml.github/workflows/studio-mac-api-smoke.yml.github/workflows/studio-mac-inference-smoke.yml.github/workflows/studio-mac-ui-smoke.yml.github/workflows/studio-mac-update-smoke.yml.github/workflows/studio-tauri-smoke.yml.github/workflows/studio-ui-smoke.yml.github/workflows/studio-update-smoke.yml.github/workflows/studio-windows-api-smoke.yml.github/workflows/studio-windows-inference-smoke.yml.github/workflows/studio-windows-ui-smoke.yml.github/workflows/studio-windows-update-smoke.yml.github/workflows/version-compat-ci.yml.github/workflows/wheel-smoke.ymlstudio/backend/core/inference/chat_template_helpers.pystudio/backend/core/inference/inference.pystudio/backend/core/inference/llama_cpp.pystudio/backend/core/inference/orchestrator.pystudio/backend/core/inference/safetensors_agentic.pystudio/backend/core/inference/tool_call_parser.py