grammar : fix lazy trigger crash during generation_prompt prefill

When a lazy grammar trigger pattern matches text in the generation_prompt
(e.g. Functionary v3.2's >>>(?!all) matches >>> at the end of the prompt),
the grammar activates during prefill and crashes with 'Unexpected empty
grammar stack' because the trigger text doesn't match the grammar's
expected start.

Fix: catch the prefill exception, disable grammar, and warn. The model
generates unconstrained but the parser still extracts tool calls. This
is safe because:
- The trigger firing during prefill is a false positive (the trigger text
  is part of the prompt template, not model output)
- Grammar constraints are a generation optimization, not a correctness
  requirement -- the parser handles extraction

An earlier approach changed find_start_pos to not replay trigger text
through the grammar. That broke Nemotron, whose grammar root starts
with the trigger literal (<tool_call>) and needs the replay to advance
past it during generation. The catch approach is correct because it only
affects the prefill path where the trigger fires prematurely, while
leaving the generation-time replay intact.

Verified with Qwen3.5-0.8B + Functionary v3.2 template override:
tools request returns 200 instead of crashing with 400.

Test:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat
This commit is contained in:
James O'Leary 2026-03-19 18:37:22 -07:00 committed by James O'Leary
parent c1b911654a
commit 370cdb9f26
2 changed files with 56 additions and 1 deletions

View File

@ -1774,7 +1774,7 @@ common_chat_msg common_chat_peg_parse(const common_peg_arena & src_pars
return msg;
}
throw std::runtime_error(std::string("Failed to parse input at pos ") + std::to_string(result.end) + ": " +
input.substr(result.end));
effective_input.substr(result.end));
}
common_chat_msg msg;

View File

@ -1954,6 +1954,61 @@ static void test_template_output_peg_parsers(bool detailed_debug) {
}
}
// Verify the throw path produces a readable error message, not std::out_of_range.
// #20424 introduced effective_input = generation_prompt + input, but the throw
// uses input.substr(result.end) where result.end is in effective_input space.
{
auto tmpls = common_chat_templates_ptr(
common_chat_templates_init(nullptr, read_file("models/templates/GLM-4.7-Flash.jinja")));
static common_chat_tool weather_tool{
"get_weather", "Get weather",
R"({"type":"object","properties":{"city":{"type":"string"}},"required":["city"]})",
};
common_chat_templates_inputs inputs;
inputs.tools = { weather_tool };
inputs.enable_thinking = true;
inputs.reasoning_format = COMMON_REASONING_FORMAT_AUTO;
inputs.add_generation_prompt = true;
inputs.use_jinja = true;
common_chat_msg msg;
msg.role = "user";
msg.content = "get_weather";
inputs.messages = { msg };
auto params = common_chat_templates_apply(tmpls.get(), inputs);
common_peg_arena arena;
arena.load(params.parser);
common_chat_parser_params pp(params);
// generation_prompt is non-empty for thinking models, so result.end
// will be offset by generation_prompt.size() into effective_input space.
assert(!pp.generation_prompt.empty());
std::string bad_input =
"Thinking.\n"
"</think>"
"<tool_call>get_weather"
"<arg_key>city</arg_key><arg_value>Tokyo</arg_value>"
"</tool_call>\n";
bool got_runtime_error = false;
bool got_out_of_range = false;
std::string error_msg;
try {
common_chat_peg_parse(arena, bad_input, /*is_partial=*/false, pp);
} catch (const std::out_of_range & e) {
got_out_of_range = true;
error_msg = e.what();
} catch (const std::runtime_error & e) {
got_runtime_error = true;
error_msg = e.what();
}
GGML_ASSERT(!got_out_of_range && "throw path crashed with out_of_range (input.substr in effective_input space)");
GGML_ASSERT(got_runtime_error && "throw path should produce std::runtime_error with parse position");
}
// Kimi-K2-Thinking tests - custom parser
// Unique feature: tool call ID embeds function name as functions.<name>:<counter>
{