llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ed Addario	d6ccd5649a	Finetune heuristics	2025-10-25 12:09:20 +01:00
Ed Addario	04561d5782	Update epsilon specifier	2025-10-21 12:53:26 +01:00
Ed Addario	27bf25e93c	Fix lambda capture	2025-10-20 22:04:35 +01:00
Ed Addario	543b5a99db	Fix lambda capture	2025-10-20 21:57:03 +01:00
Ed Addario	90402c057f	Merge branch 'master' into quantize	2025-10-20 21:01:20 +01:00
Ed Addario	fa1df81d49	Finetune heuristics	2025-10-20 20:52:23 +01:00
Sigbjørn Skjæret	84bf3c6778	model : add BailingMoeV2 support (#16063 ) * add BailingMoeV2 support * update llm types * undo * undo * update llm types * add model collection link * update * almost working * correct group selection and rename n_group_exp * avoid large top_k and use argmax instead for now if we had something like argmax2 that would be equivalent, but this works fine until then * poke * skip group selection when there are no tokens * fix 1T conversion * hopefully fixed expert group selection third time's the charm? * make expert group selection generally available The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture. * allow n_expert_groups to be 1 (Kimi K2) * address review suggestions	2025-10-20 21:38:20 +02:00
takuya kodama	06332e2867	llama-batch: fix build fails with `-Werror=missing-braces` (#16614 ) ## Why it failed When compiling with strict compiler flags (-Wmissing-braces -Werror=missing-braces), the build fails with the following error: ``` cmake \ -S . \ -B ../llama.cpp.build \ --preset=x64-linux-gcc-debug \ -DCMAKE_INSTALL_PREFIX=/tmp/local \ -DCMAKE_CXX_FLAGS="-Wmissing-braces -Werror=missing-braces" && \ cmake --build ../llama.cpp.build/ ... In file included from /home/otegami/work/cpp/llama.cpp/src/llama-graph.h:4, from /home/otegami/work/cpp/llama.cpp/src/llama-model.h:5, from /home/otegami/work/cpp/llama.cpp/src/llama.cpp:8: /home/otegami/work/cpp/llama.cpp/src/llama-batch.h:126:48: error: missing braces around initializer for 'std::__array_traits<int, 1>::_Type' {aka 'int [1]'} [-Werror=missing-braces] 126 \| std::array<llama_seq_id, 1> seq_id_0 = { 0 }; // default sequence id \| ^ cc1plus: some warnings being treated as errors ``` The issue is that std::array initialization requires double braces. ## How to fix This PR changes `{ 0 }` to `{{ 0 }}` for std::array initialization. This is part of a series of commits to fix missing braces warnings across the codebase. - src/llama-batch.h <- This PR is here. - src/llama-context.cpp - tests/test-backend-ops.cpp - tests/test-gguf.cpp - tools/mtmd/clip.cpp Benefits: - std::array is a struct containing a C-style array, requiring nested braces - Enables stricter compiler warnings to catch potential issues	2025-10-20 11:27:09 +03:00
takuya kodama	7062dd8460	llama-context: only warn on pooling_type when user specified (#16674 ) The unexpeced pooling_type warning was incorrectly shown when users did not specify the --pooling-type parameter. In this case, the parameter defaults to `LLAMA_POOLING_TYPE_UNSPECIFIED (-1)`, and the code automatically applies the model's default pooling type. Example of spurious warning: ``` $ llama-embedding -hf ggml-org/bge-m3-Q8_0-GGUF -p "hello" ... llama_init_from_model: model default pooling_type is [2], but [-1] was specified ... ``` This fix ensures the warning only appears when users explicitly specify a pooling type that differs from the model's default (e.g., using --pooling-type mean on a model that expects CLS pooling).	2025-10-20 10:44:21 +03:00
Giuseppe Scrivano	0398752dd4	model : add Granite Hybrid types (#16635 ) add Granite 4 models mapping their embedding dimensions to the # of parameters. Information taken from https://huggingface.co/ibm-granite/granite-4.0-h-tiny Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>	2025-10-19 23:54:31 +02:00
Johannes Gäßler	66b0dbcb2d	llama-model: fix insonsistent ctxs <-> bufs order (#16581 )	2025-10-17 17:41:09 +02:00
Ed Addario	41a0069613	Merge branch 'master' into quantize	2025-10-16 22:20:04 +01:00
Ed Addario	a5103933bb	Minor refactoring	2025-10-16 15:11:48 +01:00
Ed Addario	0b3e930d52	Add option to override bpw state file name	2025-10-16 11:41:26 +01:00
Ed Addario	a6853ea2ae	Add tensor type and depth heuristics	2025-10-16 11:20:24 +01:00
Xuan-Son Nguyen	3e3cb19f64	llama-quant: add support for mmproj (#16592 ) * llama-quant: add support for mmproj * Update src/llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * check prefix instead * small fix --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-15 14:48:08 +02:00
Georgi Gerganov	e60f241eac	metal : FA support F32 K and V and head size = 32 (#16531 ) * metal : FA support F32 K and V and head size = 32 * graph : remove obsolete comment [no ci]	2025-10-13 23:07:57 +03:00
Georgi Gerganov	e38b7c6e9e	graph : support cacheless embeddings with FA and iSWA (#16528 ) * graph : support cacheless embeddings with FA and iSWA * cont : deduplicate mask creation * cont : fix name	2025-10-13 22:42:37 +03:00
Ed Addario	b7911f1431	Minor refactoring	2025-10-13 17:46:45 +01:00
Ed Addario	cd734b89ce	Update quant types	2025-10-13 15:15:23 +01:00
Ed Addario	b1b58e67df	Refactor signal handlers	2025-10-13 14:54:32 +01:00
Ed Addario	ca282302b5	Add --keep-bpw-state option	2025-10-12 18:23:23 +01:00
Ed Addario	b6094a97bf	Add quant types	2025-10-12 16:30:35 +01:00
Ed Addario	12e0524f3a	Reduce compute time by parallelising tensor processing - courtesy of https://github.com/ddh0	2025-10-12 15:12:15 +01:00
Daniel Bevenius	a2fba89a42	hparams : add check for layer index in is_recurrent (#16511 ) * hparams : add check for layer index in is_recurrent This commit adds a check in the is_recurrent method to ensure that the provided layer index is within the valid range. The motivation for this change is to prevent potential out-of-bounds and also be consistent with other methods in the class that perform similar checks, like is_swa.	2025-10-12 07:19:06 +02:00
Georgi Gerganov	a3cb04744f	metal : fix mul-mm condition + fix mul-mv permuted kernels (#16494 )	2025-10-11 16:54:10 +03:00
Ed Addario	951de2e2c2	Merge branch 'master' into quantize	2025-10-11 10:49:41 +01:00
Ed Addario	5b0d3f6d5a	Automatically determine if bias error is significant	2025-10-11 10:04:48 +01:00
Georgi Gerganov	81086cd6a3	vocab : mark EOT token for Granite models (#16499 ) * vocab : mark EOT token for Granite models * sampling : fallback to EOS when EOT is not found	2025-10-10 17:17:31 +03:00
Ed Addario	c93131cef6	Remove --no-bias option	2025-10-10 13:26:51 +01:00
Ed Addario	3a3d807fc3	Remove bias mode computation	2025-10-10 13:10:42 +01:00
Georgi Gerganov	d00cbea63c	server : host-memory prompt caching (#16391 ) * minor : code style * server : fix prompt similarity calculation * server : initial host-memory prompt caching * cont * server : refactor * cont * cont : make the server task of the slot const * cont : minor [no ci] * server : cache prompts and checkpoints only for completion tasks * server : improve prompt caching logic * cont : fix check for number of cached prompts [no ci] * server : improve caching logic, add -cram CLI arg * server : print prompt mismatch info * cont : better naming [no ci] * server : improve prompt cache loading logic * server : add option to debug the slot contents (#16482) * server : add option to debug the slot contents * Update tools/server/server.cpp --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * server : add option to disable prompt cache --------- Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>	2025-10-09 18:54:51 +03:00
Ed Addario	c11184a3c1	Generate model ID hash	2025-10-09 11:58:01 +01:00
Saba Fallah	e08db42595	model: EmbeddingGemma Adding Support for SentenceTransformers Dense Modules (#16367 ) * model: EmbeddingGemma sentence-transformers dense linear projections support * model: add support for EmbeddingGemma SentenceTransformers dense linear projections Adding support for the Dense modules used in EmbeddingGemma models. EmbeddingGemma is a SentenceTransformers model with additional modules beyond the base Transformer backbone. See: https://developers.googleblog.com/en/gemma-explained-embeddinggemma-architecture-and-recipe/ * model: add support for EmbeddingGemma SentenceTransformers dense linear projections - converting model with dense-layers is optional - introduced dense config params * Update convert_hf_to_gguf.py Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> * fixed formatting issues * Update src/llama-graph.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * - removed pooling_type_opt, always allow overriding pooling_type - asserts checking dense features dims * fix python lint * fix ubuntu gcc build warning * - fixed thread-safety test - moved asserts to load_hparams * - tidying up code - simplifying graph-context expecting both dense weights * minor : add TODO --------- Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-09 09:39:18 +03:00
Georgi Gerganov	7fdd16b432	server : improve context checkpoint logic (#16440 )	2025-10-08 10:57:29 +03:00
Tarek Dakhran	aeaf8a36f0	llama : support LiquidAI LFM2-MoE hybrid model (#16464 ) * llama : support LiquidAI LFM2-MoE hybrid model Add support for [LiquidAI/LFM2-8B-A1B](https://huggingface.co/LiquidAI/LFM2-8B-A1B) model. For more information about models, please read [the blog post](https://www.liquid.ai/company/news). [HF PR](https://github.com/huggingface/transformers/pull/41401) [GGUFs](https://huggingface.co/LiquidAI/LFM2-8B-A1B-GGUF) * Do not use defaultdict * Address PR feedback	2025-10-07 20:03:35 +02:00
Georgi Gerganov	0123ff38f5	memory : use sequential equal splits for recurrent modules (#16442 )	2025-10-07 08:24:17 +03:00
Ed Addario	044fa783c7	Fix trimming logic	2025-10-06 21:40:37 +01:00
Gadflyii	3df2244df4	llama : add --no-host to disable host buffers (#16310 ) * implement --no-host to disable host buffer * fix equal_mparams * move no-host enumeration order together with other model params --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-10-06 19:55:53 +02:00
Gabe Goodhart	c08002a198	chat : Granite Docling stopping (#16438 ) * fix: Fix duplicate fake image before token on first slice Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use double-newline before overview image Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove incorrect newline at the end of granite chat template gen prompt There should not be one, even for the language models. Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * tests: Remove bad newline from granite chat template test (legacy) Branch: GraniteDoclingStopping Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-10-06 18:59:40 +02:00
Ed Addario	84ada44894	Uninstall signal handler and cleanup	2025-10-05 20:20:56 +01:00
Ed Addario	46706cec28	Persist progress	2025-10-05 20:20:28 +01:00
Ed Addario	74c62ed4e6	Add delete_bpw_state()	2025-10-05 20:19:03 +01:00
Ed Addario	02c3073b81	Add load_bpw_state()	2025-10-05 20:18:36 +01:00
Ed Addario	e48ca32f19	Add save_bpw_state()	2025-10-05 20:17:27 +01:00
Ed Addario	533cda3076	Add signal handler	2025-10-05 20:16:33 +01:00
Ed Addario	560e8c9d70	Relax lambda clamping	2025-10-05 14:41:42 +01:00
Gabe Goodhart	ca71fb9b36	model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206 ) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2025-10-05 14:57:47 +02:00
Ed Addario	fb07fe98c5	Merge branch 'master' into quantize	2025-10-03 22:44:58 +01:00
ddh0	f6dcda3900	server : context checkpointing for hybrid and recurrent models (#16382 ) * initial commit for branch 3 * generalize `swa_checkpoint` to `ctx_checkpoint` this extends `llama-server`'s SWA checkpointing logic to include hybrid/recurrent models such as Jamba, Granite * oops * disable debug prints * keep backwards compat with `--swa-checkpoints` Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update prompt re-processing message * fix off-by-one error per GG * keep `seq_rm` log per GG Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server : fix checkpoint logic to support recurrent caches * server : cleanup and fixes --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-10-03 21:34:51 +03:00

1 2 3 4 5 ...

759 Commits