llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aaron Lee	d10a5a4a5b	clean up mtp sample typing after rebase	2025-12-21 17:53:27 -05:00
samuel	fe2baf5e2d	Squashed commit of the following: commit 912ed2cd9339d1b2875d98744ca5b51fa62e581e Author: samuel <samueloliveira32df@gmail.com> Date: Sun Dec 7 23:00:29 2025 -0300 speculative (feat): implement recursive MTP drafting for GLM-4.5 commit bdf72d9552e3da64ffc85f175664713388752914 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 16:10:16 2025 -0300 sampling (feat): optimize speculative drafting with fast-path selection commit a91980a8f3475a6bbac0a64d8be06dd4b613020e Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 15:18:19 2025 -0300 mtp (chore): clean old code commit 6de0ecf55db8567db4faa99b0152b72c9e854548 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 14:40:13 2025 -0300 mtp (feat): add mtp arg commit ea77394183b8e6c368af969b8274039a54b11486 Author: samuel <samueloliveira32df@gmail.com> Date: Sat Dec 6 13:47:54 2025 -0300 mtp-graph (fix): move llama_get_logits_ith outside the loop commit 15dff208958fb66802f20ec53ce5fcaff133edb7 Merge: 171346c74 `cae85fe53` Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:44:41 2025 -0300 Merge branch 'glm4-mtp-batch' of https://github.com/SamuelOliveirads/llama.cpp into glm4-mtp-graph-cache commit `cae85fe531` Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 16 13:42:31 2025 -0300 mtp-batch(fix): avoid logits for mtp kv cache operations commit 171346c742c310bbcfbd786b61250638ccf8b44d Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 12 16:33:01 2025 -0300 mtp-graph(feat): Reactivate graph reuse only for main model path commit `0127c6beeb` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 22:20:54 2025 -0300 mtp-batch(chore): Remove final MTP debug logs and dead code commit `4bcc9e261e` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:51:22 2025 -0300 mtp-batch(fix): Correctly advance cache head and add MTP documentation commit `b4cbe030ac` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Oct 11 18:37:40 2025 -0300 mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs commit `a99709d0c1` Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 17:24:34 2025 -0300 mtp-batch(refactor): Extract decode context and MTP input logic into helper methods commit `913af8f48d` Author: samuel <samueloliveira32df@gmail.com> Date: Fri Oct 10 16:44:28 2025 -0300 mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum commit `6f74ba3807` Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 22:27:18 2025 -0300 mtp-batch (fix): prevent mtp draft from polluting the cache commit `5e1d719bef` Author: samuel <samueloliveira32df@gmail.com> Date: Thu Oct 9 15:21:23 2025 -0300 mtp-batch (feat): Create and manage sinfo for MTP commit `febd8235d2` Author: samuel <samueloliveira32df@gmail.com> Date: Sun Oct 5 14:43:40 2025 -0300 mtp-batch (wip): fix how to warmup kv cache for MTP commit `67c6c069e0` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 19:42:32 2025 -0300 mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption commit `75dc25e6fe` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 27 17:17:00 2025 -0300 mtp-batch (wip): organize batch for mtp cache commit `3da7e7f330` Author: samuel <samueloliveira32df@gmail.com> Date: Tue Sep 23 22:45:11 2025 -0300 mtp-batch (fix): warm mtp cache for small batch size commit `df64508b93` Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:55:41 2025 -0300 mtp-batch (wip): merge glm graphs commit `042eb8a829` Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 21 21:29:00 2025 -0300 mtp-batch (wip): merge mtp and model graph commit `1318b2de82` Author: samuel <samueloliveira32df@gmail.com> Date: Sun Sep 14 10:22:59 2025 -0300 mtp-batch (wip): move mtp execution to batch format commit `c6237c71ff` Merge: `9fab53e43` `8742ce0e3` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sat Sep 13 02:57:01 2025 -0400 Merge pull request #1 from SamuelOliveirads/glm4-moe-mtp feat: implemented sampling for MTP commit `8742ce0e39` Author: samuel <samueloliveira32df@gmail.com> Date: Sat Sep 6 00:21:18 2025 -0300 feat: apply logits + greedy sampler commit `5a5bce8577` Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 17:56:14 2025 -0300 fix: add sample acceptance commit `07670a22c6` Author: samuel <samueloliveira32df@gmail.com> Date: Wed Sep 3 13:25:21 2025 -0300 feat: implemented sampling for MTP commit `9fab53e438` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Sep 2 17:14:09 2025 -0400 fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch commit `98bc0c6bf2` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 26 01:26:51 2025 -0400 replace standard sampler with greedy sampler for mtp draft commit `471e026327` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 23:10:56 2025 -0400 fixed vram leak commit `d72f9d5691` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 19 01:50:34 2025 -0400 kludge-y kv cache management of mtp layer commit `382135aa36` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 21:54:45 2025 -0400 fixed mtp kv cache update sequencing after prompt processing commit `6870f9790c` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 17 04:59:36 2025 -0400 added proper KV cache management for MTP layers and slightly refactored commit `6e9bafc7a7` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Fri Aug 15 23:13:56 2025 -0400 failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable commit `cf0f7c0448` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Wed Aug 13 02:21:17 2025 -0400 broad thrust of the mtp implementation commit `03231da69e` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Tue Aug 12 01:03:59 2025 -0400 add model member function to build mtp graph, to be called from speculative.cpp commit `1f477b3755` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 20:54:45 2025 -0400 make nextn weights loadable without a crash commit `e434f87cc7` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Mon Aug 11 01:21:47 2025 -0400 some work towards building mtp layer graph commit `db60623e79` Author: Aaron Lee <lee.aaron.65@gmail.com> Date: Sun Aug 10 23:52:54 2025 -0400 added getter for nextn layer count and server slot has_mtp property	2025-12-21 17:23:35 -05:00
Georgi Gerganov	4301e27319	common : restore grammar-based rejection sampling (#18137 ) * common : restart grammar-based rejection sampling * sampling : allow null samplers	2025-12-17 19:46:00 +02:00
Georgi Gerganov	254098a279	common : refactor common_sampler + grammar logic changes (#17937 ) * common : refactor common_sampler + grammar logic changes * tests : increase max_tokens to get needed response * batched : fix uninitialized samplers	2025-12-14 10:11:13 +02:00
Georgi Gerganov	e92d53b29e	sampling : optimize samplers by reusing bucket sort (#15665 ) * sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-31 20:41:02 +03:00
g2mt	94933c8c2e	server : implement universal assisted decoding (#12635 ) * llama-server : implement universal assisted decoding * Erase prompt tail for kv-cache * set vocab_dft_compatible in common_speculative * rename ctx_main to ctx_tgt * move vocab_dft_compatible to spec struct * clear mem_dft, remove mem * detokenize id_last for incompatible models * update comment * add --spec-replace flag * accept special tokens when translating between draft/main models * Escape spec-replace * clamp draft result to size to params.n_draft * fix comment * clean up code * restore old example * log common_speculative_are_compatible in speculative example * fix * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update common/speculative.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-07-31 14:25:23 +02:00
Georgi Gerganov	745aa5319b	llama : deprecate llama_kv_self_ API (#14030 ) * llama : deprecate llama_kv_self_ API ggml-ci * llama : allow llama_memory_(nullptr) ggml-ci * memory : add flag for optional data clear in llama_memory_clear ggml-ci	2025-06-06 14:11:15 +03:00
Georgi Gerganov	e0dbec0bc6	llama : refactor llama_context, llama_kv_cache, llm_build_context (#12181 ) * llama : refactor llama_context, llama_kv_cache, llm_build_context ggml-ci * graph : don't mutate the KV cache during defrag ggml-ci * context : reduce virtuals + remove test function ggml-ci * context : move interface implementation to source file + factory ggml-ci * graph : move KV cache build functions to llama_context impl ggml-ci * graph : remove model reference from build_pooling ggml-ci * graph : remove llama_model reference ggml-ci * kv_cache : provide rope factors ggml-ci * graph : rework inputs to use only unique_ptr, remove attn input abstraction ggml-ci * context : remove llama_context_i abstraction ggml-ci * context : clean-up ggml-ci * graph : clean-up ggml-ci * llama : remove redundant keywords (struct, enum) ggml-ci * model : adapt gemma3 ggml-ci * graph : restore same attention ops as on master ggml-ci * llama : remove TODO + fix indent ggml-ci	2025-03-13 12:35:44 +02:00
mgroeber9110	5bbe6a9fe9	ggml : portability fixes for VS 2017 (#12150 ) * Add include files for std::min/max and std::toupper/tolower * win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined * Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode * win32: only use __restrict in MSVC if C11/C17 support is not enabled --------- Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>	2025-03-04 18:53:26 +02:00
Georgi Gerganov	abd4d0bc4f	speculative : update default params (#11954 ) * speculative : update default params * speculative : do not discard the last drafted token	2025-02-19 13:29:42 +02:00
Georgi Gerganov	afa8a9ec9b	llama : add `llama_vocab`, functions -> methods, naming (#11110 ) * llama : functions -> methods (#11110) * llama : add struct llama_vocab to the API (#11156) ggml-ci * hparams : move vocab params to llama_vocab (#11159) ggml-ci * vocab : more pimpl (#11165) ggml-ci * vocab : minor tokenization optimizations (#11160) ggml-ci Co-authored-by: Diego Devesa <slarengh@gmail.com> * lora : update API names (#11167) ggml-ci * llama : update API names to use correct prefix (#11174) * llama : update API names to use correct prefix ggml-ci * cont ggml-ci * cont ggml-ci * minor [no ci] * vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174) ggml-ci * vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174) ggml-ci --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-01-12 11:32:42 +02:00
Georgi Gerganov	c2a16c0bdb	server : fix free of spec context and batch (#10651 ) ggml-ci	2024-12-07 11:52:44 +02:00
Georgi Gerganov	9fd8c2687f	server : add more information about error (#10455 )	2024-11-25 22:28:59 +02:00
Georgi Gerganov	d9d54e498d	speculative : refactor and add a simpler example (#10362 ) * speculative : refactor and add a simpler example ggml-ci * speculative : clean-up and add comments and TODOs [no ci] * speculative : manage context in common_speculative ggml-ci * speculative : simplify ggml-ci * speculative : simplify (cont) ggml-ci * speculative : add --draft-min CLI arg * speculative : minor fixup * make : build fixes * speculative : do not redraft previous drafts ggml-ci * speculative : fix the draft sampling ggml-ci * speculative : fix compile warning * common : refactor args ggml-ci * common : change defaults [no ci] * common : final touches ggml-ci	2024-11-25 09:58:41 +02:00

14 Commits