llama.cpp

History

Jesse Posner 3dadc88b58 common : fix Step-3.5-Flash format detection and thinking support (#19635 ) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments\|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: https://github.com/ggml-org/llama.cpp/pull/19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.		2026-02-19 22:40:52 +01:00
..
templates	common : fix Step-3.5-Flash format detection and thinking support (#19635 )	2026-02-19 22:40:52 +01:00
.editorconfig	gguf : new file format with flexible meta data (beta) (#2398 )	2023-08-21 23:07:43 +03:00
ggml-vocab-aquila.gguf	Work on the BPE tokenizer (#3252 )	2023-10-03 09:16:26 +02:00
ggml-vocab-baichuan.gguf	Add more tokenizer tests (#3742 )	2023-10-24 09:17:17 +02:00
ggml-vocab-bert-bge.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-bert-bge.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-bert-bge.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-command-r.gguf	command-r : add BPE pre-tokenization (#7063 )	2024-05-05 08:19:30 +03:00
ggml-vocab-command-r.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-command-r.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-coder.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-deepseek-coder.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-coder.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-llm.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-deepseek-llm.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-deepseek-llm.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-falcon.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-falcon.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-falcon.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-gpt-2.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-gpt-2.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-gpt-2.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-gpt-neox.gguf	Add more tokenizer tests (#3742 )	2023-10-24 09:17:17 +02:00
ggml-vocab-llama-bpe.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-llama-bpe.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-llama-bpe.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-llama-spm.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-llama-spm.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-llama-spm.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-mpt.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-mpt.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-mpt.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-nomic-bert-moe.gguf	tests : improve UGM tokenizer test coverage (#13773 )	2025-05-25 16:22:29 +02:00
ggml-vocab-phi-3.gguf	Per token attributes (#7685 )	2024-06-04 09:17:17 +02:00
ggml-vocab-phi-3.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-phi-3.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-qwen2.gguf	llama : add BPE pre-tokenization for Qwen2 (#7114 )	2024-05-08 15:06:43 +03:00
ggml-vocab-qwen2.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-qwen2.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-refact.gguf	tests : add test-tokenizer-0.sh + fix some tokenizers (#7036 )	2024-05-04 08:32:32 +03:00
ggml-vocab-refact.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-refact.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-starcoder.gguf	llama : fix BPE pre-tokenization (#6920 )	2024-04-29 16:58:41 +03:00
ggml-vocab-starcoder.gguf.inp	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00
ggml-vocab-starcoder.gguf.out	convert : allow partial update to the chkhsh pre-tokenizer list (#13847 )	2025-05-30 12:24:37 +02:00