fix: execute tool calls emitted as plain text in interactive mode#673
fix: execute tool calls emitted as plain text in interactive mode#673mvanhorn wants to merge 1 commit into
Conversation
Greptile SummaryThis PR adds interactive recovery for tool calls returned as text. The main changes are:
Confidence Score: 5/5The changed flow looks safe to merge after a small cleanup to the JSON detection heuristic.
strix/core/execution.py Important Files Changed
Prompt To Fix All With AIFix the following 1 code review issue. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 1
strix/core/execution.py:571-576
**Generic JSON Triggers Retry**
When an interactive answer is valid plain JSON such as `{"action":"report","params":{...}}` or `{"name":"example","input":"..."}`, this predicate treats it as an unexecuted tool call because the keys overlap the heuristic. The run then appends corrective tool-use prompts until the retry limit before parking, so a normal answer can consume extra turns and mutate the session just because its schema resembles a tool call.
Reviews (1): Last reviewed commit: "fix: execute tool calls returned by Olla..." | Re-trigger Greptile |
| tool_name = payload.get("action") or payload.get("tool") or payload.get("name") | ||
| return ( | ||
| isinstance(tool_name, str) | ||
| and bool(tool_name.strip()) | ||
| and any(key in payload for key in _TOOL_ARGUMENT_KEYS) | ||
| ) |
There was a problem hiding this comment.
When an interactive answer is valid plain JSON such as {"action":"report","params":{...}} or {"name":"example","input":"..."}, this predicate treats it as an unexecuted tool call because the keys overlap the heuristic. The run then appends corrective tool-use prompts until the retry limit before parking, so a normal answer can consume extra turns and mutate the session just because its schema resembles a tool call.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/core/execution.py
Line: 571-576
Comment:
**Generic JSON Triggers Retry**
When an interactive answer is valid plain JSON such as `{"action":"report","params":{...}}` or `{"name":"example","input":"..."}`, this predicate treats it as an unexecuted tool call because the keys overlap the heuristic. The run then appends corrective tool-use prompts until the retry limit before parking, so a normal answer can consume extra turns and mutate the session just because its schema resembles a tool call.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Fair concern. Two bounds keep this from mutating a normal session much: the predicate only runs when the cycle produced no executed tool call and no lifecycle settle, and retries are capped by the same invalid_final_output_limit the non-interactive path uses - after the cap it falls back to the existing park-and-wait, so worst case for a JSON-shaped plain answer is a couple of extra turns. Validating the extracted name against registered tools would be tighter, but the sandbox-bound tools are attached per-run in the agents factory and aren't visible from module scope here, so that check would need a larger refactor than this fix wants to carry. Happy to tighten the key heuristic if maintainers prefer.
There was a problem hiding this comment.
The explanation is fair and the guard conditions do meaningfully reduce the blast radius. That said, the original concern still partially holds and there's a concrete way to tighten it without needing the per-run tool registry.
The "name" key in _is_tool_call_payload is the main culprit. It's very generic — {"name": "Alice", "input": "hello"}, {"name": "report", "parameters": {...}}, {"name": "success", "arguments": []} all match. The stricter keys (tool_calls, function) have structural nesting that makes false positives much less likely, and "action" + "tool" are more domain-specific. Dropping "name" from the tool-name candidates would cut the false-positive rate significantly at essentially no cost to the actual recovery path, since LiteLLM/Ollama plain-text tool blobs use action or tool, not name.
| tool_name = payload.get("action") or payload.get("tool") or payload.get("name") | |
| return ( | |
| isinstance(tool_name, str) | |
| and bool(tool_name.strip()) | |
| and any(key in payload for key in _TOOL_ARGUMENT_KEYS) | |
| ) | |
| tool_name = payload.get("action") or payload.get("tool") | |
| return ( | |
| isinstance(tool_name, str) | |
| and bool(tool_name.strip()) | |
| and any(key in payload for key in _TOOL_ARGUMENT_KEYS) | |
| ) |
If name needs to stay for some backend that emits it, consider requiring type == "function" alongside it (matching the OpenAI tool-call schema), which filters out almost all natural-language JSON:
name_key = payload.get("action") or payload.get("tool") or (
payload.get("name") if payload.get("type") == "function" else None
)The bounding arguments are reasonable — the concern is more about the quality of the in-session experience (spurious corrective prompts visible in the transcript) than catastrophic mutation. Either of those tightenings would address that without touching the architecture.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
Summary
Interactive runs no longer park on
Send message to resumewhen a model emits tool-call JSON as plain text. The run loop now injects a bounded corrective re-prompt (same recovery the non-interactive path already has), so the tool call gets executed instead of displayed.Why this matters
With LiteLLM-routed backends (reported on
ollama/qwen3:8b, also seen on Gemma and larger models), a cycle sometimes ends with the raw{"action": ..., "params": ...}blob as final output. The interactive path (run_agent_loop->_run_cycle->_settle_run_result) settles the agent towaitingand blocks oncoordinator.wait_for_message, so the scan stalls until a human types something. @AkikoOrenji diagnosed the same gap in the thread: some models intermittently fail to finalise the tool call and Strix has no mechanism to force a retry.The non-interactive path already solves this:
_run_noninteractive_until_lifecycledetects a non-lifecycle final output and injects a corrective message via_append_noninteractive_tool_required_message, bounded byinvalid_final_output_limit. This PR gives the interactive path the same recovery.Changes
Testing
pytest tests/test_execution.py- 8 passed, covering: corrective re-prompt on tool-call-shaped final output, successful execution on retry, bounded fallback towaitingafter the retry limit, plain-text answers still parking normally, and the non-interactive path unchanged.Fixes #520