When a lazy grammar trigger pattern matches text in the generation_prompt
(e.g. Functionary v3.2's >>>(?!all) matches >>> at the end of the prompt),
the grammar activates during prefill and crashes with 'Unexpected empty
grammar stack' because the trigger text doesn't match the grammar's
expected start.
Fix: catch the prefill exception, disable grammar, and warn. The model
generates unconstrained but the parser still extracts tool calls. This
is safe because:
- The trigger firing during prefill is a false positive (the trigger text
is part of the prompt template, not model output)
- Grammar constraints are a generation optimization, not a correctness
requirement -- the parser handles extraction
An earlier approach changed find_start_pos to not replay trigger text
through the grammar. That broke Nemotron, whose grammar root starts
with the trigger literal (<tool_call>) and needs the replay to advance
past it during generation. The catch approach is correct because it only
affects the prefill path where the trigger fires prematurely, while
leaving the generation-time replay intact.
Verified with Qwen3.5-0.8B + Functionary v3.2 template override:
tools request returns 200 instead of crashing with 400.
Test:
cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
cmake --build build --target test-chat
./build/bin/test-chat