Skip to content

Studio: tools, thinking blocks, code execution and web search for safetensors#84

Open
danielhanchen wants to merge 6 commits into
mainfrom
pr-5520-head
Open

Studio: tools, thinking blocks, code execution and web search for safetensors#84
danielhanchen wants to merge 6 commits into
mainfrom
pr-5520-head

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Staging mirror of unslothai#5520

Original PR: unslothai#5520
Author: danielhanchen

This is a staging copy for review and editing. Once finalized, changes will be pushed back to the original PR.


Original description

Summary

Today only the GGUF (llama-server) backend in Studio streams tool calls, thinking blocks, sandboxed Python/Bash execution and web search through the agentic loop. The transformers/safetensors backend is inference-only: no tools, no template-level reasoning controls, no agentic loop. This PR brings safetensors to feature parity for non-vision text chat while leaving the GGUF path untouched.

What changes

Backend:

  • core/inference/tool_call_parser.py (new) -- backend-neutral parse_tool_calls_from_text, strip_tool_markup, has_tool_signal and the shared regex set. LlamaCppBackend._parse_tool_calls_from_text delegates here so both backends fix-forward together.
  • core/inference/safetensors_agentic.py (new) -- cumulative-text agentic loop with a 3-state buffer (BUFFERING / STREAMING / DRAINING). Emits the same status / content / tool_start / tool_end / metadata events as the GGUF path so the frontend renders both backends identically. Handles duplicate-call short-circuit, __IMAGES__ sentinel stripping before model feedback, error nudge, cancel_event, max_tool_iterations cap and a final-answer attempt.
  • core/inference/inference.py -- generate_chat_response now accepts tools / enable_thinking / reasoning_effort / preserve_thinking. New _apply_chat_template_for_generation peels unsupported kwargs off the template call in safe order (richest first) so older chat templates still render. New generate_chat_completion_with_tools wraps the agentic loop.
  • core/inference/orchestrator.py -- forwards the new kwargs through both IPC paths (gen and dispatched); new generate_chat_completion_with_tools drives the loop from the parent process so tools run alongside the existing route-layer plumbing.
  • core/inference/worker.py -- pulls tools / enable_thinking / reasoning_effort / preserve_thinking from the cmd dict when present and forwards them to backend.generate_chat_response.
  • routes/inference.py -- shared _detect_safetensors_features helper that calls the existing detect_reasoning_flags on the loaded tokenizer template so /load, the already_loaded branch and /status all advertise the same flags GGUF does. New safetensors tool-calling SSE branch in POST /chat/completions mirrors the GGUF flow (system prompt nudge, tool subset filtering, stale-XML scrubbing of prior assistant turns). gpt-oss is intentionally gated out of the safetensors tool path because Harmony uses a dedicated channel for tool calls rather than <tool_call> XML; GGUF still serves that case.

Tests:

  • tests/test_safetensors_tool_loop.py -- 22 tests covering parser shapes (closed/unclosed JSON, <function=...> XML, embedded </parameter> in code, multiple calls, bad JSON), agentic-loop flow (plain answers, single tool then answer, truncated unclosed call, JSON-string arguments healed to {\"query\": ...}), behaviour (duplicate-call short-circuit, image-sentinel survival, t

This PR tracks the moving review branch (pr-5520-head). Iteration fix commits land here directly. Review-added tests are in a separate PR.

Changed files:

  • .github/workflows/consolidated-tests-ci.yml
  • .github/workflows/lint-ci.yml
  • .github/workflows/mlx-ci.yml
  • .github/workflows/notebooks-ci.yml
  • .github/workflows/release-desktop.yml
  • .github/workflows/security-audit.yml
  • .github/workflows/stale.yml
  • .github/workflows/studio-api-smoke.yml
  • .github/workflows/studio-backend-ci.yml
  • .github/workflows/studio-frontend-ci.yml
  • .github/workflows/studio-inference-smoke.yml
  • .github/workflows/studio-mac-api-smoke.yml
  • .github/workflows/studio-mac-inference-smoke.yml
  • .github/workflows/studio-mac-ui-smoke.yml
  • .github/workflows/studio-mac-update-smoke.yml
  • .github/workflows/studio-tauri-smoke.yml
  • .github/workflows/studio-ui-smoke.yml
  • .github/workflows/studio-update-smoke.yml
  • .github/workflows/studio-windows-api-smoke.yml
  • .github/workflows/studio-windows-inference-smoke.yml
  • .github/workflows/studio-windows-ui-smoke.yml
  • .github/workflows/studio-windows-update-smoke.yml
  • .github/workflows/version-compat-ci.yml
  • .github/workflows/wheel-smoke.yml
  • studio/backend/core/inference/chat_template_helpers.py
  • studio/backend/core/inference/inference.py
  • studio/backend/core/inference/llama_cpp.py
  • studio/backend/core/inference/orchestrator.py
  • studio/backend/core/inference/safetensors_agentic.py
  • studio/backend/core/inference/tool_call_parser.py

danielhanchen and others added 5 commits May 17, 2026 13:56
The GGUF/llama-server backend already streams tool_start/tool_end events,
strips <tool_call> XML, parses <think> blocks, and runs an agentic loop
through web_search / python / terminal. The transformers/safetensors
backend was inference-only: no tools, no template-level reasoning
controls, no agentic loop. This change brings safetensors to parity for
non-vision text chat while leaving the GGUF path untouched.

Backend changes:

- core/inference/tool_call_parser.py (new): backend-neutral
  parse_tool_calls_from_text, strip_tool_markup, has_tool_signal, and
  shared regex/strip patterns. LlamaCppBackend._parse_tool_calls_from_text
  delegates here, so both paths fix-forward together.
- core/inference/safetensors_agentic.py (new): cumulative-text agentic
  loop with a 3-state buffer (BUFFERING, STREAMING, DRAINING). Yields
  the same status / content / tool_start / tool_end / metadata events
  the GGUF path already emits. Handles duplicate-call short-circuit,
  __IMAGES__ sentinel stripping before model feedback, error-prefix
  tagging, cancel_event, and max_tool_iterations capping.
- core/inference/inference.py: generate_chat_response now accepts
  tools / enable_thinking / reasoning_effort / preserve_thinking;
  _apply_chat_template_for_generation peels unsupported kwargs off the
  template call in safe order (richest first). New
  generate_chat_completion_with_tools method wraps the agentic loop.
- core/inference/orchestrator.py: forwards the new kwargs through IPC
  (gen + dispatched paths); adds generate_chat_completion_with_tools
  that drives the loop from the parent process.
- core/inference/worker.py: pulls tools/enable_thinking/reasoning_effort/
  preserve_thinking from the cmd dict when present and forwards to
  backend.generate_chat_response.
- routes/inference.py: shared _detect_safetensors_features helper that
  calls detect_reasoning_flags on the loaded tokenizer template so the
  load/already_loaded/status endpoints all advertise the same flags
  GGUF does. New safetensors tool-calling SSE branch in
  POST /chat/completions that mirrors the GGUF flow (system prompt
  nudge, tool subset filtering, stale-XML scrubbing of prior
  assistant turns). gpt-oss is gated out of the safetensors tool path
  because Harmony uses a dedicated channel for tool calls rather than
  <tool_call> XML; GGUF still serves that case.

Tests:

- tests/test_safetensors_tool_loop.py: 22 tests covering parser
  shapes (closed/unclosed JSON, function/parameter XML, embedded
  </parameter> in code, multiple calls, bad JSON), agentic-loop
  control flow (plain answers, single tool then answer, truncated
  unclosed call, JSON-string arguments healed to {"query": ...}),
  behaviour (duplicate-call short-circuit, image-sentinel survival,
  tool error nudge, raised exceptions caught), and control
  (cancel_event break, max_tool_iterations cap).

Backwards compatibility:

- LlamaCppBackend._parse_tool_calls_from_text keeps the same signature
  and behaviour.
- All new IPC kwargs are optional and only added to the cmd dict when
  set, so older worker payloads are unaffected.
- The SSE event protocol matches the existing GGUF tool stream so the
  frontend tool UI works unchanged.
…e helper

Two follow-ups from the comprehensive simulation pass:

1. Bug: assistant prose containing the literal string "<tool_call>" was
   silently truncated.

   The STREAMING end-of-stream branch re-yielded the cumulative content
   with ``strip_tool_markup(..., final=True)`` whenever the parser found
   no real tool calls. ``final=True`` removes any trailing unclosed
   ``<tool_call>.*$`` run, which dropped legitimate prose mentioning the
   literal text (e.g. "the docs say <tool_call> means an LLM tool"). The
   streaming pass already emitted the cleaned cumulative content via
   partial strips, so the final re-yield was redundant and only ever
   hid real text. Drop it; the DRAINING-no-parse fallback now surfaces
   the raw content_accum instead of the final-stripped version.

   Adds regression tests covering both the prose case and the case
   where the tool RESULT text contains the literal "<tool_call>" (the
   loop must only parse model output, not tool results).

2. Extract _apply_chat_template_for_generation into
   core/inference/chat_template_helpers.apply_chat_template_for_generation
   so its kwarg-fallback chain (richest call first, peel off groups
   on TypeError, propagate real Jinja errors) can be unit-tested
   without pulling unsloth / torch / transformers into the sandbox.
   InferenceBackend's method becomes a thin delegate.

Tests:

- TestProseMentioningToolCall: two new tests for the truncation
  regression and the tool-result-text safety case.
- TestChatTemplateHelper: five new tests for the helper's fallback
  chain across template-kwarg permutations and the Jinja-error
  propagate behaviour.

All 29 tests in test_safetensors_tool_loop.py pass; the full related
suite (202 tests across test_safetensors_tool_loop, test_openai_tool_
passthrough, test_responses_tool_passthrough, test_inference_model_
validation, test_anthropic_thinking_translation, test_anthropic_code_
execution, test_anthropic_messages) is green.
_sf_tracker.__exit__(None, None, None)

return StreamingResponse(
sf_tool_stream(),
],
exec_results = ["result"],
)
events = _collect_events(loop)
],
exec_results = ["..."],
)
events = _collect_events(loop)
],
exec_results = ["Page text: <tool_call> appears here in the docs"],
)
events = _collect_events(loop)
flags["supports_reasoning"] = True
flags["reasoning_style"] = "reasoning_effort"
flags["supports_tools"] = False
except Exception:
parsed = json.loads(raw_args)
if isinstance(parsed, dict):
return parsed
except (json.JSONDecodeError, ValueError):
tc["function"]["arguments"]
)
tool_calls.append(tc)
except (json.JSONDecodeError, ValueError):
@danielhanchen

Copy link
Copy Markdown
Member Author

/gemini review

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces agentic tool-calling and reasoning controls for the safetensors/transformers backend, providing feature parity with the GGUF implementation. The changes include a new backend-neutral tool-call parser, an agentic loop for cumulative text generators, and updates to the inference orchestrator and API routes to support these features. Additionally, a version flag was added to the CLI. Review feedback points out unreachable code in the chat template helper and suggests a more robust approach for duplicate tool call detection using sorted JSON serialization.

"arguments": arguments,
}

tc_key = tool_name + str(arguments)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using str(arguments) to generate a key for duplicate tool call detection is unreliable. In Python 3.7+, dictionaries preserve insertion order. This means two dictionaries with identical keys and values but different insertion orders (which can happen if the model outputs JSON keys in a different sequence) will produce different strings from str(), causing the duplicate detection to fail. Using json.dumps(arguments, sort_keys=True) ensures a stable, canonical key regardless of insertion order.

Suggested change
tc_key = tool_name + str(arguments)
tc_key = tool_name + json.dumps(arguments, sort_keys = True)

Comment on lines +65 to +67
return tokenizer.apply_chat_template(
messages, tokenize = False, add_generation_prompt = True
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

low

This block is unreachable. The attempts list (line 46) always includes an empty dictionary {} as its final element. If tokenizer.apply_chat_template succeeds with this empty dictionary, it returns from within the loop (line 51). If it fails with a TypeError, the error is caught and stored in last_exc, the loop terminates, and last_exc is raised at line 64. If it fails with any other Exception, the loop breaks and raises at line 64. Consequently, execution can never reach lines 65-67.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants