llama.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Bevenius	9e273f7aa4	sampling : fix copying both sampled tokens and logits/probs from backend This commit fixes the issue where both sampled tokens and logits/probs were not being copied correctly from the backend to the host when multiple backend samplers were used. A test for this scenario has also been added to ensure that both types of data are copied correctly when different backend samplers are employed.	2025-11-23 13:12:01 +01:00
Daniel Bevenius	ae23d2d2c1	sampling: clarify candidate ids usage in comments	2025-11-23 11:28:19 +01:00
Daniel Bevenius	65500d05ab	sampling : add stride variable for clarity	2025-11-23 11:27:54 +01:00
Daniel Bevenius	79b8cf2a75	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-21 16:38:32 +01:00
ubergarm	23bc779a6e	model : detect GigaChat3-10-A1.8B as deepseek lite (#17420 ) * Detect GigaChat3-10-A1.8B as deepseek lite Hardcodes checking number of layers to detect if lite version of deepseek. * Add commnent identifying deepseek lite variants deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B	2025-11-21 14:51:38 +01:00
Daniel Bevenius	61ffe41dc1	sampling : use pinned memory for backend sampling buffers	2025-11-21 14:02:16 +01:00
Xuan-Son Nguyen	054a45c3d3	grammar: fix regression caused by #17381 (#17412 ) * grammar: fix regression caused by #17381 * more readable	2025-11-20 18:35:10 +01:00
Daniel Bevenius	0d28b16bdc	sampling : introduce sampling_info struct This commit introduces a sampling_info struct to encapsulate all backend sampling related data within the llama_context class. It also updates to use more descriptive names for sampled tokens and candidates in the backend sampler ggml data structure.	2025-11-20 14:45:56 +01:00
Piotr Wilkin (ilintar)	92c0b387a9	grammar : fix integer overflow (#17381 ) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX	2025-11-20 14:47:04 +02:00
Georgi Gerganov	196f5083ef	common : more accurate sampling timing (#17382 ) * common : more accurate sampling timing * eval-callback : minor fixes * cont : add time_meas impl * cont : fix log msg [no ci] * cont : fix multiple definitions of time_meas * llama-cli : exclude chat template init from time measurement * cont : print percentage of unaccounted time * cont : do not reset timings	2025-11-20 13:40:10 +02:00
Daniel Bevenius	ed4345bdd9	squash! common : fix regression caused by extra memory allocations during sampling Apply the same changes to llama-sampling.cpp, llama_sampler_sample as were applied in commit `38f408c25`.	2025-11-20 07:56:33 +01:00
Daniel Bevenius	0c660e7390	Merge remote-tracking branch 'upstream/master' into backend-sampling	2025-11-20 06:57:24 +01:00
Daniel Bevenius	18ed4d8f96	squash! sampling : simplify backend sampling logic decode The commit fixes a variable shadowing issue in the `llama_context::decode` function which was introduced in a previous refactoring.	2025-11-19 15:10:15 +01:00
Daniel Bevenius	d74eb61aa7	squash! sampling : simplify backend sampling logic decode Fix condition to check if backend actually sampled tokens, not just that backend samplers are available.	2025-11-19 11:29:26 +01:00
Daniel Bevenius	7e98ebcc6b	sampling : simplify backend sampling logic decode This commit tries to simplify the backend sampling logic in llama_context::decode.	2025-11-19 09:31:33 +01:00
Daniel Bevenius	51fee29822	sampling : always populate logits for sampled probs This commit updates common/sampler.cpp set_logits and src/llama-sampling.cpp llama_sampler_sample to always populate the logits field when backend sampled probabilities are available. The motivation for this is that this ensure that CPU sampler always have access to the logits values even when probabilites have been produced by backend samplers.	2025-11-19 07:14:11 +01:00
Daniel Bevenius	0da7e7dccc	sampling : remove version from sampler chain This commit removes the version field from the sampler chain and instead used the sampler pointer itself for change detection.	2025-11-19 06:59:03 +01:00
Haiyue Wang	a045492088	vocab : call reserve() for building plamo-2-translate suffix (#17343 ) Test 'Q4_K_M' quantization on https://huggingface.co/pfnet/plamo-2-translate The 'suffix_to_score' size is 193510, it needs 19 memory allocation with final capacity 262144 to hold the value, if not preserve the memory. Signed-off-by: Haiyue Wang <haiyuewa@163.com>	2025-11-18 18:58:22 +01:00
Daniel Bevenius	311c1a347f	sampling : ensure at most one output token per seq This commit adds a check in the batch allocator to ensure that when backend sampling is enabled, at most one output token is specified per sequence.	2025-11-18 16:06:23 +01:00
Daniel Bevenius	82957a90f2	sampling : always expose sampled_ids This commit precomputes and caches the full-vocab token id list in llama_context's constructor, so llama_get_backend_sampled_token_ids_ith always returns a valid pointer. The motivation for this is that this enables both common/sampling.cpp and src/llama-sampling.cpp can simplify their logic. Not all backends samplers that process logits need to set the sampled_tokens_id as they may not change the order of the logits, for example the temperature sampler only scales the logits but does not change their order. Simliar the logit bias sampler only adds bias to specific token ids but does not change the order of the logits. In these cases there will not be a device to host copy of the sampled token ids, and this is the use case where having this precomputed list is useful.	2025-11-18 15:11:59 +01:00
Georgi Gerganov	4b52e59903	graph : do not include llama-model.h	2025-11-18 13:53:25 +02:00
Daniel Bevenius	7884b0e0ac	sampling : add support for backend sampling This commit adds support for performing sampling operations on the backend (e.g. GPU) as part of the model computation graph. The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend. For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory. It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers. Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.	2025-11-17 16:15:58 +01:00
Bartowski	e1fcf8b09b	model : add AfmoeForCausalLM support (#16477 ) * Add AFMOE model support * Update to vocab * Add model sizing * Undo Rope change for ARCEE model * Address review comments * Update modeling code is_sliding -> use_rope, replace hard-coded logic * Fix AFMOE tokenizer * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update AFMoE tokenizer class identification to be more unique --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-14 13:54:10 +01:00
Marek Hradil jr.	6cd0cf72ce	fix : Dangling pointer for non-empty trigger words in lazy grammar construction (#17048 ) * fix : Dangling pointer for non-empty trigger words in llama_sampler_init_grammar_impl (#17047) * Replace 'static' workaround, with keeping variable in scope for longer * Create std::array directly and pass into llama_grammar_init_impl * Add back the trigger pattern * Missed array include	2025-11-14 14:35:26 +02:00
Aman Gupta	a90eb94ca9	CUDA: fuse rope + set_rows (#16884 ) * CUDA: add fused rope * move k forward_expand up * create helper function instead of re-using params * make assert statement more in line with comment * rope_norm: coalesced writes to global mem	2025-11-13 08:50:01 +08:00
o7si	ffb6f3d921	vocab : correct bounds check for UGM XCDA array access (#17215 )	2025-11-12 23:41:02 +01:00
Mike Abbott	4a5b8aff40	cmake : add version to all shared object files (#17091 ) When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.	2025-11-11 13:19:50 +02:00
Sigbjørn Skjæret	7bef684118	models : move build_inp_out_ids outside loop (#17151 ) * move build_inp_out_ids outside loop * realign	2025-11-10 22:55:30 +01:00
Gabe Goodhart	0c74f32632	memory: Hybrid context shift (#17009 ) * feat(memory): Only fail partial erasure of recurrent tail The recurrent state is always assumed to be the state as of the last update from the final token in the sequence. When doing a partial erasure, if the range does not include the final token, the erasure can be considered a success since any memory used for the sequence prior to the final token (which is no memory) has been successfully removed. There is one potential case that this doesn't address which is the pruning of cache to remove sensitive data from the context. This wouldn't work for attention cache partial removal (in the middle) either since the KV state is linearly-dependent and states in later sequence positions would still be based on the state from the sensitive data, even if that data is no longer cached, so I don't think this is relevant, but it is worth noting that the semantics of this change for a partial erasure in the middle of the cache are essentially "my context is already compressed" and not "all trace of the removed tokens has been removed." https://github.com/ggml-org/llama.cpp/issues/16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(main): Check the output of seq_rm for prefix matching This prefix matching is explicitly attempting to remove the tokens at the end of the sequence that don't match. This is the operation that can't be performed on a recurrent cache due to the state being updated in place, so if this removal fails, we need to clear the whole cache. https://github.com/ggml-org/llama.cpp/issues/16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(memory): Fix condition for partial erasure failure if p0 > pos Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: compilade <git@compilade.net> * style: Fix extra parens Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear https://github.com/ggml-org/llama.cpp/issues/16768 Branch: HybridContextShift-16768 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: compilade <git@compilade.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-10 17:14:23 +02:00
Sigbjørn Skjæret	9008027aa3	hparams : add n_embd_inp() to support extended embed (#16928 ) * add n_embd_full to support extended embed * don't change output * rename to n_embd_inp * restore n_embd where applicable	2025-11-07 19:27:58 +01:00
Georgi Gerganov	16bcc1259d	kv-cache : pad the cache size to 256 for performance (#17046 ) * kv-cache : pad the size of the small SWA cache for performance * context : pad the total context to 256 * cont : future-proof the swa pad * server : adjust test params to new logic	2025-11-07 20:03:25 +02:00
Johannes Gäßler	aa374175c3	CUDA: fix crash on uneven context without FA (#16988 )	2025-11-06 14:05:47 +01:00
Li Pengzhan	9f052478c2	model : add openPangu-Embedded (#16941 ) * Model: add openPangu-Embedded * fixed according to reviewer's comments * fixed the chat template check condition * Apply suggestions from code review change the chat-template check condition and some formatting issue Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * whitespace cleanup --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-11-05 10:28:58 +01:00
Sigbjørn Skjæret	b164259bba	chore : fix models indent after refactor (#16992 )	2025-11-04 12:29:15 +01:00
Georgi Gerganov	cd5e3b5754	server : support unified cache across slots (#16736 ) * server : support unified context across slots * cont : fix speculative decoding initialization * context : fix n_ctx_per_seq computation * server : purge slots one by one * tests : add unified cache server tests * llama : update per-seq context computation * test-thread-safety : handle tiny training context of the input model * server : fix server_tokens clear() * server : use 4 slots + unified KV by default * llama : add note about context size queries * cont : update todos [no ci] * context : do not cap the size of the context * tests : adjust parameters to be CI friendlier * context : add warning	2025-11-02 18:14:04 +02:00
Piotr Wilkin (ilintar)	bea04522ff	refactor : llama-model.cpp (#16252 ) * Sqashed: llama-model.cpp refactoring * Fix formatting of attn / ffn / ffn_moe calls * Fix import regression / unify spacing in models.h * totally DID NOT miss those! * Add missing qwen3vl(moe) models * Add missing new .cpp files to build * Remove extra semicolons * Editor checker * Update src/models/models.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 23:40:23 +01:00
Piotr Wilkin (ilintar)	0de0a01576	model : Minimax M2 (#16831 ) * Model: Minimax M2 * Cleanup * Cleanup pt. 2 * Cleanup pt. 3 * Update convert_hf_to_gguf_update.py - merge catch blocks Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Remove vocab models and test * Remove all redundant hparam settings covered by TextModel * Move super to start, don't set block_count * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-31 21:20:47 +01:00
Giuseppe Scrivano	e58d585604	model : add Granite Hybrid nano types (#16896 ) Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-31 21:20:07 +01:00
Georgi Gerganov	8da3c0e200	batch : fix consistency checks for the input positions (#16890 )	2025-10-31 13:50:33 +02:00
JJJYmmm	d261223d24	model: add support for qwen3vl series (#16780 ) * support qwen3vl series. Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> * bugfix: fix the arch check for qwen3vl-moe. * use build_ffn * optimize deepstack structure * optimize deepstack feature saving * Revert "optimize deepstack feature saving" for temporal fix This reverts commit `f321b9fdf1`. * code clean * use fused qkv in clip * clean up / rm is_deepstack_layers for simplification * add test model * move test model to "big" section * fix imrope check * remove trailing whitespace * fix rope fail * metal : add imrope support * add imrope support for sycl * vulkan: add imrope w/o check * fix vulkan * webgpu: add imrope w/o check * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix tensor mapping --------- Co-authored-by: Thireus ☠ <Thireus@users.noreply.github.com> Co-authored-by: yairpatch <yairpatch@users.noreply.github.com> Co-authored-by: LETS-BEE <LETS-BEE@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-10-30 16:19:14 +01:00
Tianyue-Zhao	bacddc049a	model: Add support for CogVLM model (#15002 ) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-30 12:18:50 +01:00
Jan Boon	d7395115ba	llama : use std::abs instead of abs (#16853 )	2025-10-30 08:30:58 +02:00
Xuan-Son Nguyen	3464bdac37	llama: fix ASAN error with M-RoPE (#16848 )	2025-10-29 20:11:39 +01:00
Xuan-Son Nguyen	e3af5563bd	llama: store mrope data in KV cell (#16825 ) * llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-29 18:09:18 +01:00
Georgi Gerganov	85a7d8677b	memory : remove KV cache size padding (#16812 ) * memory : remove KV cache size padding * cont : restore padding for n_kv tensor shape * server : use slot context size instead of training context size * server : simplify context limit logic	2025-10-28 20:19:44 +02:00
Johannes Gäßler	7a0e900e36	llama: consistent ctx <-> buf order for KV cache (#16746 )	2025-10-28 11:23:54 +01:00
Diego Devesa	5a4ff43e7d	llama : disable pipeline parallelism if compute buffer allocation fails (#16748 )	2025-10-27 21:51:28 +01:00
Johannes Gäßler	945501f5ea	llama: fix leaked buffers for mmap + split files (#16765 )	2025-10-27 09:17:31 +01:00
Sigbjørn Skjæret	73a48c9790	convert : enable expert group selection for all models with it (#16691 )	2025-10-26 17:21:23 +01:00
Sigbjørn Skjæret	f696428ce8	graph : add clamping to ffn_moe_weights_sum to avoid div-by-zero (#16655 ) * add missing norm topk bias * use clamping instead, update number and add comment	2025-10-26 17:20:32 +01:00

1 2 3 4 5 ...

677 Commits