llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Patrick Lehr	a8b192b6ec	tests : enable test-chat out of tree build (#19558 ) The binary relies on model files that it tries to find. However, when configuring the build directory to be parallel to the source tree those heuristics fail. This sets the working directory for the test executable to be the source-tree which resolves this issue.	2026-02-27 05:37:54 +01:00
Maximilian Werk	66287bdaac	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 ) * WIP: Add EuroBERT support with autoformatting changes This commit includes: - EuroBERT model implementation for GGUF conversion - C++ backend support for EuroBERT architecture - Unintended autoformatting changes to Python files Saving before reverting formatting-only changes. * feat: add back eos assert when not last token pooling * feat: removed duplicated code and cleanup * feat: removed not working architectures and unnecessary check * fix: typo * fix: dynamic pooling config * feat: added an example model for eurobert * feat: proper llama-vocab implementation for jina-v5 * fix: removed unnecessary comments	2026-02-26 12:14:09 +01:00
Georgi Gerganov	418dea39ce	ggml/gguf : prevent integer overflows (#19856 ) * gguf : prevent integer overflow for ggml_context mem size * ggml : fix int overflows in ggml_new_object() * gguf : prevent string exhaustion * gguf : prevent array elements exhaustion * ggml : fix negative tensor type oob * py : assert that alignment is non-zero power of 2 * ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object * gguf-py : error on duplicate keys when reading * py : restore tensor_fields * enforce proper alignment in add_custom_alignment * gguf : better name * gguf : fix ctx size for no_alloc == true * gguf : minor print fix * ggml : print values when overflow * ggml : remove deprecated ggml_type_sizef() * ggml : relax ggml_type asserts to debug-only * gguf : add mem_size overflow test * gguf : add file size check for arrays * ggml : relax asseerts for ggml_get_type_traits() * flake8 fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-24 20:17:11 +02:00
Daniel Bevenius	d8aeb65cee	tests : fix typos in comments in test-backend-sampler [no ci] (#19824 ) * tests : fix typos in comments in test-backend-sampler [no ci]	2026-02-23 17:12:02 +01:00
Xuan-Son Nguyen	5452d736f8	jinja: correct stats for tojson and string filters (#19785 )	2026-02-22 21:08:23 +01:00
Aldehir Rojas	94b0200a01	common : merge qwen3-coder and nemotron nano 3 parsers (#19765 ) * common : migrate qwen3-coder to PEG parsing variant * cont : add JSON parameter test	2026-02-20 23:22:22 +01:00
Jeff Bolz	77d6ae4ac8	test: mul_mat tests with huge batch size (#19519 )	2026-02-19 20:08:25 -06:00
Jesse Posner	3dadc88b58	common : fix Step-3.5-Flash format detection and thinking support (#19635 ) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments\|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: https://github.com/ggml-org/llama.cpp/pull/19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.	2026-02-19 22:40:52 +01:00
Piotr Wilkin (ilintar)	8a70973557	Add Jinja support for "indent" string filter (#19529 ) * Add partial Jinja support for "indent" string filter * Fully implement indent * Add tests for all width variants. * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix getline ignoring trailing newlines * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix first indent condition --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 00:25:52 +01:00
Georgi Gerganov	08e6d914b8	ggml : avoid UB in gemm ukernel (#19642 )	2026-02-15 14:56:35 +02:00
Jeff Bolz	dbb023336b	vulkan: support L2_NORM with contiguous rows (#19604 )	2026-02-14 06:42:04 +01:00
ymcki	0e21991472	fix vulkan ggml_acc only works in 3d but not 4d (#19426 ) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-13 13:31:37 +01:00
Georgi Gerganov	490eb96b88	metal : support GGML_OP_SET (#19548 )	2026-02-13 07:34:52 +02:00
Georgi Gerganov	3b3a948134	metal : update sum_rows kernel to support float4 (#19524 )	2026-02-12 11:35:28 +02:00
Georgi Gerganov	914dde72ba	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 ) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-11 18:58:43 +02:00
Georgi Gerganov	89181c0b6d	ggml : extend bin bcast for permuted src1 (#19484 ) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-11 07:52:00 +02:00
Georgi Gerganov	ceaa89b786	metal : consolidate unary ops (#19490 )	2026-02-11 07:51:12 +02:00
Xuan-Son Nguyen	9a96352729	test: fix IMROPE perf test case (#19465 )	2026-02-10 14:37:50 +01:00
Georgi Gerganov	a0d585537c	cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429 ) * cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad	2026-02-10 08:07:16 +02:00
Hugo	1e8924fd65	cmake : add variable to skip installing tests (#19370 ) When packaging downstream, there's usually little point in installing test. The default behaviour remains the same.	2026-02-09 07:12:02 +01:00
Jeff Bolz	db6adb3c88	tests: reduce number of FA test permutations (#19381 ) Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).	2026-02-06 08:50:30 -06:00
Jeff Bolz	449ec2ab07	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281 ) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-05 09:26:38 -06:00
Georgi Gerganov	eaba92c3dc	tests : add non-cont, inplace rope tests (#19296 ) * tests : add non-cont, inplace rope tests * cont : exercise dim 3 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * cont : more dim3 exercises --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2026-02-04 12:45:21 +02:00
Sid Mohan	0dfcd3b607	jinja : add missing 'in' test to template engine (#19004 ) (#19239 ) * jinja : add missing 'in' test to template engine (#19004) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * reuse test_is_in in binary op --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-02-02 21:00:55 +01:00
Aman Gupta	9f682fb640	ggml-cpu: FA split across kv for faster TG (#19209 ) * ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl	2026-02-03 01:19:55 +08:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
Georgi Gerganov	c3b87cebff	tests : add GQA=20 FA test (#19095 )	2026-01-30 13:52:57 +02:00
Aldehir Rojas	7b7ae857f6	chat : add parsing for solar-open-100b (#18540 ) * chat : add parsing for solar-open-100b * add comments to rules * cont : make assistant start optional * cont : remove assistant start prefix altogether --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-01-29 16:06:15 +01:00
Sigbjørn Skjæret	b45ef2702c	jinja : do not pass empty tools and add some none filters (#19176 )	2026-01-29 14:06:54 +01:00
Sigbjørn Skjæret	60368e1d73	jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147 ) * undefined is treated as iterable (string/array) by filters `tojson` is not a supported `undefined` filter * add tests * add sequence and iterable tests keep it DRY and fix some types	2026-01-28 14:40:29 +01:00
Sigbjørn Skjæret	2b4cbd2834	jinja : implement mixed type object keys (#18955 ) * implement mixed type object keys * add tests * refactor * minor fixes * massive refactor * add more tests * forgotten tuples * fix array/object is_hashable * correct (albeit broken) jinja responses verified with transformers * improved hashing and equality * refactor hash function * more exhausive test case * clean up * cont * cont (2) * missing cstring --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-01-27 19:50:42 +01:00
Johannes Gäßler	b0311c16d2	CUDA: fix padding of GQA to power of 2 in FA (#19115 )	2026-01-26 23:24:58 +01:00
Johannes Gäßler	4e5b83b226	GGUF: check that tensor size is representable (#19072 )	2026-01-24 21:57:51 +01:00
Xuan-Son Nguyen	51fa458a92	server : support preserving reasoning_content in assistant message (#18994 ) * support reasoning_content input * report template caps to webui * add docs * rm commented code	2026-01-22 21:30:06 +01:00
Georgi Gerganov	a5eaa1d6a3	mla : make the V tensor a view of K (#18986 ) * mla : pass V as a view of K to the FA op * cuda : adjust mla logic to new layout * kv-cache : fix rope shift * tests : remove comment * cuda : fix reusable_cutoff Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-22 22:09:01 +02:00
Piotr Wilkin (ilintar)	c301172f66	jinja: support none\|string (#18995 ) * jinja: support none\|string * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add as_string() --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-21 19:24:37 +01:00
Jeff Bolz	33f890e579	vulkan: support flash attention GQA/split_k with small batches (#18938 )	2026-01-21 17:43:43 +01:00
Xuan-Son Nguyen	2c1f199653	cli : fix reasoning responses in CLI (#18961 ) * cli : fix reasoning responses in CLI * fix build * fix build (2)	2026-01-20 18:23:25 +01:00
Sigbjørn Skjæret	959ecf7f23	jinja : fix undefined keys and attributes and int/float as bool (#18924 ) * fix undefined keys and attributes * add falsy tests * as_bool for integers and floats * more falsy/truthy tests * --typo	2026-01-19 20:29:43 +01:00
Sigbjørn Skjæret	4037093c66	ci : run test-jinja -py on high perf [no ci] (#18916 )	2026-01-19 20:29:15 +01:00
Xuan-Son Nguyen	fe44d35574	tests : add test-jinja -py option for cross-checking (#18906 ) * tests : add test-jinja -py option or cross-checking * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix + add source * SandboxedEnvironment * fix array.map case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-18 08:14:27 +01:00
Sigbjørn Skjæret	d03c45c9c5	jinja : attribute support for join, map and sort (#18883 ) * support negative array index and default value * attribute support (int and str) for join, map and sort * add tests * update CODEOWNERS * improve fixme sorting comment	2026-01-18 02:53:01 +01:00
Sigbjørn Skjæret	10c98cbdf6	jinja : add missing tojson filter for bool (#18900 ) * add missing tojson for bool * add more literal tests	2026-01-18 01:05:09 +01:00
Sigbjørn Skjæret	420960ab92	jinja : fix lexing of float literals with sign (#18901 ) * fix lexing of float literals with sign * add test * consume_numeric	2026-01-18 00:57:51 +01:00
Xuan-Son Nguyen	f55b033ae6	jinja: correct member access rule (#18905 )	2026-01-18 00:48:55 +01:00
Thore Koritzius	388ce82241	ggml : extend ggml_pool_1d + metal (#16429 ) * chore: resolve conflicts * feat: ggml metal impl * fix: ggml_metal_kargs_pool_1d struct * fix: require contiguous input * chore: test pool_1d * chore: limit pool1d test cases to p0=0 and s0=k0 to conform with asserts * chore: add p0 and s0 to testing * fix: allow padding for cpu and metal * Update ggml/src/ggml-metal/ggml-metal.metal * fix: correct single-threaded loop * ggml : cleanup * tests : add ne[1] != 1 tests * fix: ne[1] handling in np * cont : fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-16 16:59:56 +02:00
Xuan-Son Nguyen	c15395f73c	common : implement new jinja template engine (#18462 ) * jinja vm * lexer * add vm types * demo * clean up * parser ok * binary_expression::execute * shadow naming * bin ops works! * fix map object * add string builtins * add more builtins * wip * use mk_val * eval with is_user_input * render gemma tmpl ok * track input string even after transformations * support binded functions * keyword arguments and slicing array * use shared_ptr for values * add mk_stmt * allow print source on exception * fix negate test * testing more templates * mostly works * add filter_statement * allow func to access ctx * add jinja-value.cpp * impl global_from_json * a lot of fixes * more tests * more fix, more tests * more fixes * rm workarounds * demo: type inferrence * add placeholder for tojson * improve function args handling * rm type inference * no more std::regex * trailing spaces * make testing more flexible * make output a bit cleaner * (wip) redirect minja calls * test: add --output * fix crash on macro kwargs * add minimal caps system * add some workarounds * rm caps_apply_workarounds * get rid of preprocessing * more fixes * fix test-chat-template * move test-chat-jinja into test-chat-template * rm test-chat-jinja from cmake * test-chat-template: use common * fix build * fix build (2) * rename vm --> interpreter * improve error reporting * correct lstrip behavior * add tojson * more fixes * disable tests for COMMON_CHAT_FORMAT_GENERIC * make sure tojson output correct order * add object.length * fully functional selectattr / rejectattr * improve error reporting * more builtins added, more fixes * create jinja rendering tests * fix testing.h path * adjust whitespace rules * more fixes * temporary disable test for ibm-granite * r/lstrip behavior matched with hf.js * minimax, glm4.5 ok * add append and pop * kimi-k2 ok * test-chat passed * fix lstrip_block * add more jinja tests * cast to unsigned char * allow dict key to be numeric * nemotron: rm windows newline * tests ok * fix test * rename interpreter --> runtime * fix build * add more checks * bring back generic format support * fix Apertus * [json.exception.out_of_range.403] key 'content' not found * rm generic test * refactor input marking * add docs * fix windows build * clarify error message * improved tests * split/rsplit with maxsplit * non-inverse maxsplit forgot to change after simplifying * implement separators for tojson and fix indent * i like to move it move it * rename null -- > none * token::eof * some nits + comments * add exception classes for lexer and parser * null -> none * rename global -> env * rm minja * update docs * docs: add input marking caveats * imlement missing jinja-tests functions * oops * support trim filter with args, remove bogus to_json reference * numerous argument fixes * updated tests * implement optional strip chars parameter * use new chars parameter * float filter also has default * always leave at least one decimal in float string * jinja : static analysis + header cleanup + minor fixes * add fuzz test * add string.cpp * fix chat_template_kwargs * nits * fix build * revert * unrevert sorry :) * add fuzz func_args, refactor to be safer * fix array.map() * loosen ensure_vals max count condition, add not impl for map(int) * hopefully fix windows * check if empty first * normalize newlines --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-16 11:22:06 +01:00
Adrien Gallouët	ec997b4f2b	tests : download models only when running ctest (#18843 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-15 09:47:29 +01:00
Oliver Simons	36f0132464	CUDA: Factor out and re-use `block_reduce` function (#18785 ) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-15 10:44:54 +08:00
Adrien Gallouët	f709c7a33f	ci, tests : use cmake to download models and remove libcurl dependency (#18791 ) * ci, tests : use cmake to download models and remove libcurl dependency * llama_dl_model -> llama_download_model * use EXPECTED_HASH for robust model downloading * Move llama_download_model to cmake/common.cmake Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 07:46:27 +01:00

1 2 3 4 5 ...

661 Commits