llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	8544aba37f	sampling : generic ggml op support detection	2025-12-11 13:19:43 +02:00
Georgi Gerganov	d5d16651a8	cont : fix build	2025-12-11 11:27:47 +02:00
Georgi Gerganov	54e9054017	sampling : optimize logit_bias sampler	2025-12-11 11:14:39 +02:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	804e7e3795	graph : respect sampler order for graph reuse	2025-12-10 20:40:15 +02:00
Georgi Gerganov	44d5c4b592	batch : fix sequence id ownage	2025-12-10 20:35:58 +02:00
Georgi Gerganov	38882247d3	Merge branch 'master' into HEAD	2025-12-10 17:07:21 +02:00
Eric Zhang	b677721819	model : Qwen3-Next-80B-A3B has 48 layers (#17898 ) * model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name	2025-12-10 15:22:40 +01:00
Georgi Gerganov	c02654eb7d	graph : make the compute graph constant with respect to active samplers	2025-12-10 16:19:18 +02:00
Georgi Gerganov	81cb5783c8	Merge branch 'master' into HEAD	2025-12-10 13:41:32 +02:00
Georgi Gerganov	34b407b41c	sampling : use host buffer type for inputs	2025-12-09 17:53:17 +02:00
Georgi Gerganov	92ff767918	llama : require backend samplers to be of type llama_sampler_chain	2025-12-09 15:38:37 +02:00
Rhys-T	63908b631a	cmake: fix Mach-O current version number (#17877 ) PR #17091 set the VERSION of various libraries to 0.0.abcd, where abcd is the LLAMA_BUILD_NUMBER. That build number is too large to fit in the Mach-O 'current version' field's 'micro' part, which only goes up to 255. This just sets the Mach-O current version to 0 to get it building properly again. Fixes #17258.	2025-12-09 13:17:41 +02:00
Sigbjørn Skjæret	42b12b5608	model : nit, DeepSeek V1 MoE is 16B and GigaChat is 20B (#12652 ) * nit, DeepSeek V1 MoE is 16B * base type on n_ff_exp instead	2025-12-09 12:15:06 +01:00
Georgi Gerganov	560ac16f7d	server : handle unsupported cases	2025-12-09 10:55:11 +02:00
Aldehir Rojas	e39502e74b	llama : add token matching support to llama-grammar (#17816 ) * llama : add token support to llama-grammar * fix inverse token comment * refactor trigger_patterns to replay tokens instead of the entire string * add token documentation * fix test-llama-grammar * improve test cases for tokens	2025-12-09 00:32:57 -06:00
philip-essential	1d2a1ab73d	model : support Rnj-1 (#17811 ) * add support for rnj1 * refactor gemma3 to support rnj-1 * address review comments	2025-12-09 04:49:03 +01:00
Sigbjørn Skjæret	c8554b66e0	graph : use fill instead of scale_bias in grouped expert selection (#17867 ) * use fill instead of scale_bias in grouped expert selection * do not explicitly use _inplace	2025-12-08 21:29:59 +01:00
Georgi Gerganov	f3beb22b17	sampling : handle n_probs case	2025-12-08 21:30:10 +02:00
Georgi Gerganov	6d38db5dfe	Merge branch 'master' into HEAD	2025-12-08 17:55:24 +02:00
Piotr Wilkin (ilintar)	e4e9c4329c	Make graph_max_nodes vary by ubatch size (#17794 ) * Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph * Update src/llama-context.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Add missing const --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-08 14:32:41 +01:00
Xuan-Son Nguyen	4d3726278b	model: add llama 4 scaling for mistral-large (deepseek arch) (#17744 )	2025-12-07 22:29:54 +01:00
Georgi Gerganov	72e3681073	sampling : fix top-p	2025-12-07 17:11:50 +02:00
Georgi Gerganov	8ef5f900db	cont : fixes	2025-12-07 15:45:00 +02:00
Georgi Gerganov	fdac9686f7	Merge branch 'master' into HEAD	2025-12-06 16:55:33 +02:00
Georgi Gerganov	30742a6ff5	sampling : expand support (wip)	2025-12-06 16:51:56 +02:00
Daniel Bevenius	444f00b0ec	llama : remove quantization sanity check (#17788 ) * llama : remove quantization sanity check This commit removes the quantization sanity check for attention layers. The motivation for this is that there are model that are hybrid models that have recurrent layers, experts layers, and attention layers. For these models the current check fails as the experts layers are not taking into account. After consideration, it was decided that this check is not strictly necessary, and can be removed to allow for more flexible model architectures. * llama : remove unused pruned_attention_w and is_clip_model vars	2025-12-06 12:26:20 +01:00
Oliver Simons	7668999518	Merge branch 'master' into gpu-sampling Let's keep `master's` cumsum implementation for it's likely better AMD perf and add back pure-CUB-implementation in follow-up commit	2025-12-05 14:41:08 +01:00
Georgi Gerganov	cf74b1a8ec	sampling : fix candidates logic	2025-12-05 14:24:28 +02:00
Pascal	1be97831e4	fix: prevent segfault in tokenizer on highly repetitive input (#17786 ) Add nosubs\|optimize flags to std::regex constructors to prevent catastrophic backtracking when processing prompts with repeated identical characters (e.g., 'A' * 10000). The nosubs flag disables subgroup capture, significantly reducing memory usage and backtracking on uniform token sequences	2025-12-05 13:52:23 +02:00
Georgi Gerganov	7864074fdb	sampling : fix outputs and device checks	2025-12-04 19:33:01 +02:00
Georgi Gerganov	6958d41366	sampling : check backend support during init	2025-12-04 17:29:08 +02:00
Georgi Gerganov	1bde70785d	sampling : remove redundant calls to ggml_build_forward_expand	2025-12-04 14:25:28 +02:00
Georgi Gerganov	fce571ee51	sampling : simplify temp sampling	2025-12-04 14:23:02 +02:00
Daniel Bevenius	ac9e164714	sampling : fix backend temp sampling to use logits masking	2025-12-04 09:39:20 +01:00
Georgi Gerganov	a67ef0f47f	llama : fix sanity checks during quantization (#17721 )	2025-12-04 10:33:42 +02:00
Daniel Bevenius	10bd640aae	Revert "sampling : stop short if backend sampler sampled a token" This reverts commit `87b2719eca`.	2025-12-04 08:26:33 +01:00
Daniel Bevenius	c0b182f4d6	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-04 08:17:50 +01:00
Daniel Bevenius	87b2719eca	sampling : stop short if backend sampler sampled a token This commit modifies the graph building logic to immediately continue when a token has already been sampled by the backend sampler. It also updates the test for backend temporary sampling to include top-k and distribution samplers in the chain to verify that they are not producing any logits (they are not run).	2025-12-04 08:13:49 +01:00
Georgi Gerganov	cce3b2a8ad	sampling : minor cleanup	2025-12-03 15:39:44 +02:00
Herman Semenoff	37adc9c6ba	ggml, llama : use defaulted constructors/destructors (#17649 )	2025-12-03 07:12:18 +01:00
Daniel Bevenius	aad5a6afd7	sampling : implement temp_ext_backend sampling This commit implements the apply function for the extended temperature sampling.	2025-12-02 17:26:04 +01:00
Adrien Gallouët	f3a9674ae8	llama : fix signed comparison warning on FreeBSD (#17497 ) This ensures correct RLIM_INFINITY handling and compatibility on all platforms (32/64-bit). warning: comparison of integers of different signs: 'rlim_t' (aka 'long') and 'size_t' (aka 'unsigned long') [-Wsign-compare] 488 \| if (suggest && (lock_limit.rlim_max > lock_limit.rlim_cur + size)) { \| ~~~~~~~~~~~~~~~~~~~ ^ ~~~~~~~~~~~~~~~~~~~~~~~~~~ Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2025-12-02 12:05:38 +01:00
Daniel Bevenius	db8972e251	squash! sampling : fix backend temp sampler for zero temperature This modifies the parent commit to simply return the most probably token instead of masking the logits.	2025-12-02 11:53:29 +01:00
Daniel Bevenius	3e9a258c14	Merge remote-tracking branch 'upstream/master' into gpu-sampling	2025-12-02 09:26:04 +01:00
Daniel Bevenius	739b597804	sampling : fix backend temp sampler for zero temperature This commit fixes the implementation of the temperature-based sampler for the case when the temperature is set to zero. This now correctly selects the most probable token by masking out all other tokens in the logits.	2025-12-02 09:13:07 +01:00
Piotr Wilkin (ilintar)	746f9ee889	Override SSM_A op for Qwen3 Next to reduce splits (#17587 ) * Override SSM_A op for Qwen3 Next to reduce splits * New tensor mapping SSM_A_NOSCAN for SSM_A used outside of OP_SSM_SCAN context. * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-02 00:43:13 +01:00
Gilad S.	00c361fe53	fix: llama arch implementation (#17665 )	2025-12-01 21:21:13 +01:00
Georgi Gerganov	88cca45bb8	sampling : fix top_p empty condition	2025-12-01 18:02:34 +02:00
Georgi Gerganov	04f2822a86	sampling : do not create empty samplers	2025-12-01 17:52:07 +02:00
Georgi Gerganov	4032ce2378	common : simplify sampler chain initialization	2025-12-01 17:11:11 +02:00
Oliver Simons	217469f07f	Make backend's top_p sampler inclusive In addition to match the algorithm proposed in the original [paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case where `max_p is > top_p` for a single logit, where the mask would otherwise be empty (and we thus sample from the whole vocabulary with equal likelihood)	2025-12-01 15:28:06 +01:00
Oliver Simons	ae0bb6a6da	Factor out `ggml_sort` into its own function	2025-12-01 15:28:06 +01:00
Georgi Gerganov	16451d6bc3	Merge branch 'master' into HEAD	2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen	cd3c118908	model: support Ministral3 (#17644 ) * conversion script * support ministral 3 * maybe this is better? * add TODO for rope_yarn_log_mul * better ppl (tested on 14B-Instruct) * Add Ministral3 support to Mistral format * improve arch handling * add sizes * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits --------- Co-authored-by: Julien Denize <julien.denize@mistral.ai> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-01 12:26:52 +01:00
Oliver Simons	8bee483c97	Fix backend_top_p_sampler softmax(softmax) will return uniform distribution, so we should not return the softmax but the logits instead.	2025-12-01 12:07:30 +01:00
Aman Gupta	6eea666912	llama-graph: avoid expand_forward for fusion (#17633 )	2025-12-01 11:12:48 +02:00
Daniel Bevenius	cf0e1475c5	sampling : lower log level for output buffer reallocations [no ci] This commit changes the logging level for output buffer reallocations in the llama_context::output_reserve function from INFO to DEBUG. The motivation for this is that it currently logs to info and when enabling verbose logging for llama-cli this will get mixed with the output, for example: ```console What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB 1. Stockholm 2\. Helsinki Based are the options 1. Stockholm Explanation: Stockholm is the capital of ... ```	2025-12-01 09:13:47 +01:00
Georgi Gerganov	80742cbaeb	cont : naming	2025-11-30 11:24:30 +02:00
Georgi Gerganov	c187003d81	llama : naming	2025-11-30 00:05:47 +02:00
Georgi Gerganov	1760bd69b3	llama : reserve graphs with samplers	2025-11-29 23:57:25 +02:00
Georgi Gerganov	ff7b0bf632	llama : call backend_init once	2025-11-29 23:09:53 +02:00
Georgi Gerganov	d8d98bb4bb	Merge branch 'master' into HEAD	2025-11-29 22:38:44 +02:00
Georgi Gerganov	9028ebfea8	llama : cleanup + naming	2025-11-29 22:37:07 +02:00
Georgi Gerganov	fbc8f49f3c	llama : simplify	2025-11-29 17:01:00 +02:00
Diego Devesa	e072b2052e	ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched (#17276 ) * ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched Enabled in ggml-ci for testing. * llama : update worst-case graph for unified cache * ci : disable op offload in some tests * fix spelling --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 17:33:23 +02:00
Georgi Gerganov	2464d1b3fc	sampling : simplify	2025-11-28 17:21:12 +02:00
Daniel Bevenius	8cac9dee45	sampling : use logits directly for min-p filtering	2025-11-28 16:12:05 +01:00
Oliver Simons	333da805fe	Add initial version for top-p sampling As we only support static graphs for the time and we don't know the size of the output of top-p, we have to do value-scaling same as for min-p operator. Further improvements can be applied to the unit-test (i.e. check for equivalence of top_p happening on backend with top_p happening on cpu) and also by constructing candidates and sorting those as opposed to reversing the sort of the logits (this would be arange + get_rows instead of argsort + get_rows)	2025-11-28 15:16:20 +01:00
Georgi Gerganov	117e2079a9	refactor : simplify and improve memory management	2025-11-28 16:09:42 +02:00
Daniel Bevenius	459b7ae7b9	squash! sampling : support intermixed backend/cpu samplers Fix llama-save-load-state which currently fails by handling the case when batch.logits is nullptr (like when loading state) by allocating space for all outputs as CPU logits.	2025-11-28 13:50:47 +01:00
Piotr Wilkin (ilintar)	ff55414c42	model : Qwen3 Next (#16095 ) * Qwen3 Next - cleaned up version * Whitespaces and stuff * Correct minor errors * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Misc. fixes. * Clean up code, add missing hybrid qualifier * Did someone transpose the SOLVE_TRI result matrix? Perhaps... * Whitespace * Proper tensors for cb calls * Use llama-graph.h vertical alignment * BROKEN: chunking * Set new tensors as inputs. * Proper chunk logic * It's the circle of life... * More shenanigans for n_seq > 1 * Nail in the coffin? * Fix Windows build * Eh, one fails on Windows, the other fails on Mac... just use general capture. * quant : cleanup * model : cleanup * qwen3 : cleanup * cont : cleanup * cont : cleanup * ggml : revert change * qwen3 : cleanup * cont : cleanup * Readd cmath * qwen3 : fix typo * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Usual suspects * fix my bad suggestion --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-28 12:02:56 +01:00
Daniel Bevenius	9ad6522be6	squash! sampling : support intermixed backend/cpu samplers Add check that logits is not null which is can happen for embeddings.	2025-11-28 08:57:48 +01:00
Daniel Bevenius	74be332e24	sampling : support intermixed backend/cpu samplers This commit updates the backend sampling implementation to support intermixed usage of backend and CPU samplers within the same batch. The initial implementation was developed as an all-or-nothing solution: either perform backend sampling for the entire batch, or perform CPU sampling for the entire batch. The motivation for this change is to support batches with mixed sequences. For example, we may have a backend sampler configured for sequence 0, while sequence 1 in the same batch uses CPU sampling. This was not supported in the initial implementation. This issue manifested in llama-server with the webui: decoding with backend samplers would work initially, but after changing to CPU sampling, a slot (sequence) could still be using a backend sampler. This meant that logits in output_reserve would not be allocated, resulting in an error. The solution in this commit inspects the batch to determine which sampling modes are needed and allocates buffers accordingly. However, there is a known inefficiency: when we have intermixed backend/CPU samplers in the same batch, we currently copy all logits to the host, even for sequences using backend samplers. Added test_backend_cpu_mixed_batch to verify correct behavior with mixed backend/CPU samplers in a single batch, including dynamic sampler switching between decode calls.	2025-11-28 08:38:05 +01:00
Georgi Gerganov	c386114922	arch : add description about LLM_TENSOR_INFOS (#17550 )	2025-11-27 16:34:13 +02:00
Georgi Gerganov	6783b11fb0	models : fix LFM2 tensors (#17548 )	2025-11-27 16:04:29 +02:00
Daniel Bevenius	172208afbf	sampling : add comments about backend sampler [no ci] This commit adds a comment to llama_context's constructor explaining why backend samplers are initialized early in the process.	2025-11-27 14:59:52 +01:00
Daniel Bevenius	d9d736102b	sampling : use argmax for min-p sampling	2025-11-27 07:38:44 +01:00
Daniel Bevenius	b45d504e70	sampling : add min-p backend sampler	2025-11-26 10:50:58 +01:00
Daniel Bevenius	ec047e12ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 15:16:44 +01:00
Georgi Gerganov	583cb83416	ggml : add ggml_top_k (#17365 ) * ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments	2025-11-25 15:31:43 +02:00
Daniel Bevenius	2b4c7927ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 06:10:33 +01:00
Aaron Teo	877566d512	llama: introduce support for model-embedded sampling parameters (#17120 )	2025-11-25 09:56:07 +08:00
Daniel Bevenius	134e6940ca	llama : skip output reordering for single token batches (#17466 ) This commit adds a check to skip the output reordering logic when n_outputs == 1. With a single output token, the data is trivially sorted and the reordering code is currently doing unnecessary work (resetting and rebuilding output_ids to the same values). The motivation for this change is improved code clarity and avoiding confusion when debugging. While the performance impact is probably negligible, this unnecessary work happens on every decode call in llama-server when processing batches with single-token outputs.	2025-11-24 21:06:17 +01:00
Daniel Bevenius	a02adf4211	sampling : add assertions for contiguous tensors in async copy functions	2025-11-24 21:01:06 +01:00
Georgi Gerganov	883a87043a	samplers : add missing cont	2025-11-24 21:46:57 +02:00
Daniel Bevenius	25f33806d3	sampling : add debug log when backend sampler selects token This commit adds a debug log statement in the llama_sampler_sample to indicate when a backend sampler has selected a token for a given index. The modification helps in tracing the sampling process and understanding the flow of control when backend samplers are used.	2025-11-24 15:03:41 +01:00
Daniel Bevenius	8eb9b4769d	sampling : remove redundant checks for stride and size [no ci]	2025-11-24 13:53:29 +01:00
Daniel Bevenius	4a90583d7d	sampling : cleanup and clarify output_reserve	2025-11-24 13:26:18 +01:00
Daniel Bevenius	7816f0bb56	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-24 07:44:06 +01:00
william pan	4902eebe33	models : Added support for RND1 Diffusion Language Model (#17433 ) * Converted RND1 model to GGUF weights * RND1 llama.cpp support v1 * RND1 llama.cpp support v2 non causal bug * RND1 llama.cpp support v3 doccumentation * RND1 llama.cpp support v4 clean code * linting issues * RND1 pr fixes v1 * RND1 pr fixes v2 Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Diffusion documentation edits --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-24 14:16:56 +08:00
Daniel Bevenius	9e273f7aa4	sampling : fix copying both sampled tokens and logits/probs from backend This commit fixes the issue where both sampled tokens and logits/probs were not being copied correctly from the backend to the host when multiple backend samplers were used. A test for this scenario has also been added to ensure that both types of data are copied correctly when different backend samplers are employed.	2025-11-23 13:12:01 +01:00
Daniel Bevenius	ae23d2d2c1	sampling: clarify candidate ids usage in comments	2025-11-23 11:28:19 +01:00
Daniel Bevenius	65500d05ab	sampling : add stride variable for clarity	2025-11-23 11:27:54 +01:00
Daniel Bevenius	79b8cf2a75	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-21 16:38:32 +01:00
ubergarm	23bc779a6e	model : detect GigaChat3-10-A1.8B as deepseek lite (#17420 ) * Detect GigaChat3-10-A1.8B as deepseek lite Hardcodes checking number of layers to detect if lite version of deepseek. * Add commnent identifying deepseek lite variants deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B	2025-11-21 14:51:38 +01:00
Daniel Bevenius	61ffe41dc1	sampling : use pinned memory for backend sampling buffers	2025-11-21 14:02:16 +01:00
Xuan-Son Nguyen	054a45c3d3	grammar: fix regression caused by #17381 (#17412 ) * grammar: fix regression caused by #17381 * more readable	2025-11-20 18:35:10 +01:00
Daniel Bevenius	0d28b16bdc	sampling : introduce sampling_info struct This commit introduces a sampling_info struct to encapsulate all backend sampling related data within the llama_context class. It also updates to use more descriptive names for sampled tokens and candidates in the backend sampler ggml data structure.	2025-11-20 14:45:56 +01:00
Piotr Wilkin (ilintar)	92c0b387a9	grammar : fix integer overflow (#17381 ) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX	2025-11-20 14:47:04 +02:00

1 2 3 4 5 ...

818 Commits