llama.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Bevenius	c5d44b8525	llama : fix typo in comment [no ci]	2025-12-17 09:02:30 +01:00
Daniel Bevenius	ad1b60abc4	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-12-16 09:45:08 +01:00
Georgi Gerganov	c560316440	graph : reuse SSM graphs (#16490 ) * graph : reuse hybrid graphs * graph : reuse recurrent graphs * graph : fix reuse check for recurrent inputs * memory : move the recurrent state into the memory context * Revert "memory : move the recurrent state into the memory context" This reverts commit 00f115fe810815d4a22a6dee0acc346131e970e1. * cont : fix build	2025-12-16 09:36:21 +02:00
Daniel Bevenius	2995341730	llama : add support for NVIDIA Nemotron 3 Nano (#18058 ) * llama : add support for NVIDIA Nemotron Nano 3 This commit adds support for the NVIDIA Nemotron Nano 3 model, enabling the conversion and running of this model. Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-16 07:19:26 +01:00
Georgi Gerganov	0086c246ee	Merge branch 'master' into HEAD	2025-12-14 16:44:30 +02:00
Xuan-Son Nguyen	0759b09c90	graph: add f_attn_temp_offset (#18025 )	2025-12-14 13:05:59 +01:00
Georgi Gerganov	609a2d0268	models : fix YaRN regression + consolidate logic (#18006 ) * models : fix YaRN regression + consolidate logic * cont : fix the fix * cont : remove header * cont : add header	2025-12-14 08:34:56 +02:00
Georgi Gerganov	7bed317f53	models : fix the attn_factor for mistral3 graphs + improve consistency (#17945 ) * models : fix the attn_factor for mistral3 graphs * cont : rework attn_factor correction logic * cont : make deepseek2 consistent * cont : add TODO * cont : special-case DSv2 * cont : revert Mistral 3 Large changes * cont : fix DS2 to use the original attn_factor * cont : minor comments	2025-12-12 17:12:40 +02:00
Georgi Gerganov	4d10b78e23	Merge branch 'master' into HEAD	2025-12-11 14:42:56 +02:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	804e7e3795	graph : respect sampler order for graph reuse	2025-12-10 20:40:15 +02:00
Georgi Gerganov	c02654eb7d	graph : make the compute graph constant with respect to active samplers	2025-12-10 16:19:18 +02:00
Georgi Gerganov	81cb5783c8	Merge branch 'master' into HEAD	2025-12-10 13:41:32 +02:00
Sigbjørn Skjæret	c8554b66e0	graph : use fill instead of scale_bias in grouped expert selection (#17867 ) * use fill instead of scale_bias in grouped expert selection * do not explicitly use _inplace	2025-12-08 21:29:59 +01:00
Georgi Gerganov	8ef5f900db	cont : fixes	2025-12-07 15:45:00 +02:00
Georgi Gerganov	30742a6ff5	sampling : expand support (wip)	2025-12-06 16:51:56 +02:00
Georgi Gerganov	7864074fdb	sampling : fix outputs and device checks	2025-12-04 19:33:01 +02:00
Daniel Bevenius	10bd640aae	Revert "sampling : stop short if backend sampler sampled a token" This reverts commit `87b2719eca`.	2025-12-04 08:26:33 +01:00
Daniel Bevenius	87b2719eca	sampling : stop short if backend sampler sampled a token This commit modifies the graph building logic to immediately continue when a token has already been sampled by the backend sampler. It also updates the test for backend temporary sampling to include top-k and distribution samplers in the chain to verify that they are not producing any logits (they are not run).	2025-12-04 08:13:49 +01:00
Georgi Gerganov	4032ce2378	common : simplify sampler chain initialization	2025-12-01 17:11:11 +02:00
Georgi Gerganov	16451d6bc3	Merge branch 'master' into HEAD	2025-12-01 14:47:50 +02:00
Xuan-Son Nguyen	cd3c118908	model: support Ministral3 (#17644 ) * conversion script * support ministral 3 * maybe this is better? * add TODO for rope_yarn_log_mul * better ppl (tested on 14B-Instruct) * Add Ministral3 support to Mistral format * improve arch handling * add sizes * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * nits --------- Co-authored-by: Julien Denize <julien.denize@mistral.ai> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-01 12:26:52 +01:00
Aman Gupta	6eea666912	llama-graph: avoid expand_forward for fusion (#17633 )	2025-12-01 11:12:48 +02:00
Georgi Gerganov	c187003d81	llama : naming	2025-11-30 00:05:47 +02:00
Georgi Gerganov	1760bd69b3	llama : reserve graphs with samplers	2025-11-29 23:57:25 +02:00
Georgi Gerganov	ff7b0bf632	llama : call backend_init once	2025-11-29 23:09:53 +02:00
Georgi Gerganov	9028ebfea8	llama : cleanup + naming	2025-11-29 22:37:07 +02:00
Daniel Bevenius	ec047e12ee	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-25 15:16:44 +01:00
Georgi Gerganov	583cb83416	ggml : add ggml_top_k (#17365 ) * ggml : add ggml_top_k * cont : add ggml_argsort_top_k * metal : add top_k support * ggml : cleanup * tests : add virtual err() function for test_case * ggml : add comments	2025-11-25 15:31:43 +02:00
Daniel Bevenius	0d28b16bdc	sampling : introduce sampling_info struct This commit introduces a sampling_info struct to encapsulate all backend sampling related data within the llama_context class. It also updates to use more descriptive names for sampled tokens and candidates in the backend sampler ggml data structure.	2025-11-20 14:45:56 +01:00
Daniel Bevenius	0da7e7dccc	sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection.	2025-11-19 06:59:03 +01:00
Georgi Gerganov	4b52e59903	graph : do not include llama-model.h	2025-11-18 13:53:25 +02:00
Daniel Bevenius	7884b0e0ac	sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.	2025-11-17 16:15:58 +01:00
Aman Gupta	a90eb94ca9	CUDA: fuse rope + set_rows (#16884 ) * CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem	2025-11-13 08:50:01 +08:00
Sigbjørn Skjæret	9008027aa3	hparams : add n_embd_inp() to support extended embed (#16928 ) * add n_embd_full to support extended embed * don't change output * rename to n_embd_inp * restore n_embd where applicable	2025-11-07 19:27:58 +01:00
Jan Boon	d7395115ba	llama : use std::abs instead of abs (#16853 )	2025-10-30 08:30:58 +02:00
Sigbjørn Skjæret	f696428ce8	graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655 ) * add missing norm topk bias * use clamping instead, update number and add comment	2025-10-26 17:20:32 +01:00
Aman Gupta	f77c13b91f	CUDA: General GEMV fusion (#16715 )	2025-10-26 19:28:04 +08:00
Sigbjørn Skjæret	84bf3c6778	model : add BailingMoeV2 support (#16063 ) * add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions	2025-10-20 21:38:20 +02:00
Georgi Gerganov	e60f241eac	metal : FA support F32 K and V and head size = 32 (#16531 ) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-13 23:07:57 +03:00
Georgi Gerganov	e38b7c6e9e	graph : support cacheless embeddings with FA and iSWA (#16528 ) * graph : support cacheless embeddings with FA and iSWA * cont : deduplicate mask creation * cont : fix name	2025-10-13 22:42:37 +03:00
Saba Fallah	e08db42595	model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367 ) * model: EmbeddingGemma sentence-transformers dense linear projections support * model: add support for EmbeddingGemma SentenceTransformers dense linear projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/ * model: add support for EmbeddingGemma SentenceTransformers dense linear projections - converting model with dense-layers is optional - introduced dense config params * Update convert_hf_to_gguf.py Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * fixed formatting issues * Update src/llama-graph.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * - removed pooling_type_opt, always allow overriding pooling_type - asserts checking dense features dims * fix python lint * fix ubuntu gcc build warning * - fixed thread-safety test - moved asserts to load_hparams * - tidying up code - simplifying graph-context expecting both dense weights * minor : add TODO --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-09 09:39:18 +03:00
Sigbjørn Skjæret	835b2b915c	model : add GroveMoE support (#15510 ) * add GroveMoE support * remove constexpr that fails on certain compilers * revert crude scalar div implementation, use cast * build_attn_inp_kv_unified -> build_attn_inp_kv * fix build_attn * re-apply ffn_exps regex changes	2025-09-25 19:50:28 +02:00
Aman Gupta	077c94d0ca	CUDA: add a fused top-K MoE kernel (#16130 ) * CUDA: add a fused top-K MoE kernel This kernel does the following: 1. softmax over the logits per token [n_experts, n_tokens] 2. argmax reduce over the top-k (n_experts_used) logits 3. write weights + ids to global memory It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models * Refactor into ggml_cuda_should_use_topk_moe * Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before * Review: format + micro-optimizations * Fix bug: fix tie breakers * Add optional norm + clean-up code * Use smem for final write * Add bounds check * Use better memory pattern for writeback	2025-09-25 16:35:05 +02:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Sigbjørn Skjæret	b8e09f08b9	model : add grok-2 support (#15539 ) * add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams	2025-09-14 23:00:59 +02:00
Sigbjørn Skjæret	6ab397e12b	graph : support non-contiguous Q in build_attn_mha (#15908 ) * support non-contiguous Q in build_attn_mha * Update src/llama-graph.cpp ggml-ci Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-10 19:08:59 +02:00
Georgi Gerganov	663027fd54	context : fix n_outputs during reserve (#15858 ) ggml-ci	2025-09-08 10:26:36 +03:00
Georgi Gerganov	c610b6c11b	kv-cache : fix SWA checks + disable cacheless iSWA (#15811 ) ggml-ci	2025-09-05 10:39:22 +03:00
Daniel Bevenius	fb15d649ed	llama : add support for EmbeddingGemma 300m (#15798 ) This commit add support for the EmbeddingGemma 300m. This model supports sliding window attention (SWA) and a new swq_type is introduced to support symmetric SWA masking. This commit also extracts the code from the function llama_is_masked_swa in llama-impl.h, so that the logic can be shared by both llm_graph_input_attn_no_cache::set_input and llama_kv_cache::set_input_kq_mask. With this commit the EmbeddingGemma 300m model can be converted to to GGUF and used with llama.cpp. Once the model has been uploaded to HuggingFace it can be used like this: ```console ./build/bin/llama-cli -hf ggml-org/embeddinggemma-300m-GGUF:Q8_0 ```	2025-09-04 18:10:29 +02:00

1 2 3

113 Commits