github ggml-org/llama.cpp b8113

latest release: b8115
2 hours ago
Details

common : fix Step-3.5-Flash format detection and thinking support (#19635)

  • common : fix Step-3.5-Flash format detection and thinking support

Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder
(<tool_call><function=...><parameter=...>) but its Jinja template lacks
the bare and plural markers that the detection
logic previously required. This caused it to fall through to Hermes 2
Pro, which doesn't call func_args_not_string(), so arguments stayed as
JSON strings and templates using arguments|items crashed.

Additionally, the Qwen3-Coder-XML format handler had no thinking support.
Models like Step-3.5-Flash that unconditionally emit in their
generation prompt need the same thinking_forced_open handling that
Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content
is never separated from content in API responses.

Changes:

  • Relax Qwen3-Coder XML detection to only require the 3 shared markers
  • Tighten Nemotron v3 branch to also require bare and plural
    , preventing Step-3.5-Flash from being misrouted via
  • Add thinking_forced_open support to Qwen3-Coder-XML init function
  • Add / to preserved tokens
  • Fix build_grammar_xml_tool_call to handle thinking_forced_open in the
    grammar root rule, allowing before tool calls
  • Add Step-3.5-Flash chat template and format detection test

Builds on: #19283

  • chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests

Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and
Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with
unconditional output. Route it to the Nemotron v3 PEG parser
for streaming and schema-aware parameter parsing.

Detection: templates with + XML tool tags use Nemotron v3 PEG
parser; templates without (Qwen3-Coder) use GBNF grammar.

Tests cover: basic messages, tool calls with/without thinking content,
parallel tool calls, code string parameters, optional
closing tags, and JSON schema response format.

  • chat : remove dead thinking code from qwen3_coder_xml

Remove thinking handling code that became unreachable after routing
Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no
in its template, so the thinking_forced_open logic, preserved
tokens, and grammar prefix were dead paths.

macOS/iOS:

Linux:

Windows:

openEuler:

Don't miss a new llama.cpp release

NewReleases is sending notifications on new releases.