The change introduces a maximum repetition threshold to avoid
excessive rule expansion during grammar parsing. When parsing
repetition patterns like {m,n}, the parser now calculates the
potential number of rules that would be generated and throws an error
if the product of previous rules and new rules exceeds the threshold.
A test case was added to verify the threshold is properly enforced for
deeply nested repetition patterns that would otherwise cause hangs.
This commit adds a new test case to the grammar integration tests that
specifically targets a hang scenario in the repetition grammar parser
found while adding GBNF support to ripgrep-edit.
llama-server reproducer:
curl \
-X POST \
-d '{
"messages": [{ "role": "user", "content": "write yes" }],
"grammar": "root ::= (([^x]*){0,99}){0,99}"
}' \
-H "Content-Type: application/json" \
http://localhost:8811/v1/chat/completions
This change converts the function to an iterative approach using
explicit stacks, which prevents deep recursion and eliminates the risk
of stack overflow.
rg-edit regexp: llama_grammar_advance_stack
rg-edit extra-args: -A30
rg-edit directive: """Rewrite: fix the following segfault:
[..]
⚫ Testing segfault. Grammar:
root ::= ( [x]* )*
root ::= ( [x]* )*
Segmentation fault build/bin/test-grammar-integration
convert from recursive to interactive"""
gptel-context:
(("~/llama.cpp/src/llama-grammar.cpp")
("~/llama.cpp/tests/test-grammar-integration.cpp")
("~/llama.cpp/grammars/./list.gbnf")
("~/llama.cpp/grammars/./json_arr.gbnf")
("~/llama.cpp/grammars/./json.gbnf")
("~/llama.cpp/grammars/./japanese.gbnf")
("~/llama.cpp/grammars/./english.gbnf")
("~/llama.cpp/grammars/./chess.gbnf")
("~/llama.cpp/grammars/./c.gbnf")
("~/llama.cpp/grammars/./arithmetic.gbnf")
("~/llama.cpp/grammars/./README.md"))
v2: Added a `std::set` to perform tree-based lookups with O(N log N)
complexity. Testing with a parallel run of `test-grammar-integration`
shows a double-digit percentage increase in runtime. An
`unordered_set` with O(1) hashing was also evaluated, but the overhead
of constructing hash keys from pointers made it significantly slower
than the rbtree implementation that only requires an ordering
operator. The performance regression in the test suite appears
justified by the overall reduction in algorithmic complexity.
Co-developed-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
* llama-quant : correct `n_attention_wv` usage
In #19770, I introduced a regression in the way the
`quantize_state_impl` counter values were initialized. I was
incrementing and using `n_attention_wv` in the same loop, when it should
have been fixed by the time we're deciding tensor types in
`llama_tensor_get_type_impl` (for `use_more_bits`).
I never observed a difference in any of [my
tests](https://github.com/ggml-org/llama.cpp/pull/19770#issuecomment-4000424712)
- it was only after @bartowski kindly pointed this out that I realized
it was incorrect. (Thanks!)
* simplify
* K quant speedup (#20)
* Basic JIT compilation for mul_mat, get_rows, and scale (#17)
* scale jit working
* preliminary working jit for getrows and mulmat, needs refining
* simplified mul_mat preprocessing switch statement
* get_rows fixes, mul_mat refinement
* formatted + last edits
* removed some extraneous prints
* fixed get_rows, fixed workgroup dispatch in mul_mat. no gibberish
* small fix
* some changes, working
* get_rows and mul_mat jit fixed and working
* Update formatting
* formatting
* Add header
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Start work on all-encompassing shader library
* refactor argmax, set_rows
* Refactor all but flashattention, mat mul
* no gibberish, all k quants added, merged
* vec memory fix
* q6_k matching metal on my machine, tests passing
* Set tile size for q6_k separately
* Separate out fast shaders
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Move towards writeBuffer for params
* Move away from multiple buffers for set_rows errors, remove host buffer for parameter buffers, minor cleanups
* Remove extra file
* Formatting
---------
Co-authored-by: neha-ha <137219201+neha-ha@users.noreply.github.com>
* Fix logic for retrieving schema items in `json_schema_to_grammar.py`
If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error.
I think if `schema['items']` is `{}`, them items should just be `{}`
* Apply suggestion from @CISC
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tests for arrays with empty items
Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case.
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Replace GGML_ABORT("fatal error") in ggml_metal_synchronize() with
error flag + return. This aligns synchronize error handling with
graph_compute, which already returns GGML_STATUS_FAILED for the same
condition.
When a command buffer fails (e.g., iOS GPU access revocation during
backgrounding, macOS eGPU disconnect, OOM), the backend enters an
error state instead of killing the host process. Subsequent
graph_compute calls return GGML_STATUS_FAILED immediately. Recovery
requires recreating the backend.
Failed extra command buffers are properly released on the error path
to avoid Metal object leaks.
* quantize : imatrix-fail early + code cleanup
* fix manual override printing
it's in the preliminary loop now, so needs to be on its own line
* revert header changes per ggerganov
* remove old #includes
* clarify naming
rename `tensor_quantization` to `tensor_typo_option` to descirbe its
functionality
* fix per barto
* Parse port numbers from MCP server URLs
* Pass scheme to http proxy for determining whether to use SSL
* Fix download on non-standard port and re-add port to logging
* add test
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Enable mul_mv_ext small-batch kernels (BS 2-8) for BF16, Q2_K,
and Q3_K quantization types. These types previously fell through
to the slower single-row mul_mv path.
BF16 uses the float4 dequantize path (like F16). Q2_K and Q3_K
use the float4x4 K-quant path (like Q4_K/Q5_K/Q6_K).
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>