samuel
4bcc9e261e
mtp-batch(fix): Correctly advance cache head and add MTP documentation
2025-10-11 18:51:22 -03:00
samuel
b4cbe030ac
mtp-batch(chore): Fix logit flags for speculative sampling and remove debug logs
2025-10-11 18:37:40 -03:00
samuel
913af8f48d
mtp-batch(refactor): Replace MTP boolean flags with an explicit operation enum
2025-10-10 16:44:28 -03:00
samuel
6f74ba3807
mtp-batch (fix): prevent mtp draft from polluting the cache
2025-10-09 22:27:18 -03:00
samuel
5e1d719bef
mtp-batch (feat): Create and manage sinfo for MTP
2025-10-09 15:21:23 -03:00
samuel
febd8235d2
mtp-batch (wip): fix how to warmup kv cache for MTP
2025-10-05 14:43:40 -03:00
samuel
67c6c069e0
mtp-batch (wip): Isolate MTP graph to prevent host embedding buffer corruption
2025-09-27 19:42:32 -03:00
samuel
75dc25e6fe
mtp-batch (wip): organize batch for mtp cache
2025-09-27 17:17:00 -03:00
samuel
3da7e7f330
mtp-batch (fix): warm mtp cache for small batch size
2025-09-23 22:45:11 -03:00
samuel
df64508b93
mtp-batch (wip): merge glm graphs
2025-09-21 21:55:41 -03:00
samuel
1318b2de82
mtp-batch (wip): move mtp execution to batch format
2025-09-14 10:22:59 -03:00
samuel
8742ce0e39
feat: apply logits + greedy sampler
2025-09-06 00:21:18 -03:00
samuel
5a5bce8577
fix: add sample acceptance
2025-09-03 17:56:14 -03:00
samuel
07670a22c6
feat: implemented sampling for MTP
2025-09-03 13:25:21 -03:00
Aaron Lee
9fab53e438
fixed mtp kv cache update step in cases where prompt size > n_batch and n_ubatch
2025-09-02 17:14:09 -04:00
Aaron Lee
98bc0c6bf2
replace standard sampler with greedy sampler for mtp draft
2025-08-26 01:26:51 -04:00
Aaron Lee
6870f9790c
added proper KV cache management for MTP layers and slightly refactored
2025-08-17 04:59:36 -04:00
Aaron Lee
6e9bafc7a7
failed attempt to implement MTP; outputs tokens but KV cache management is unreasonable
2025-08-15 23:13:56 -04:00
Aaron Lee
cf0f7c0448
broad thrust of the mtp implementation
2025-08-13 02:21:17 -04:00
g2mt
94933c8c2e
server : implement universal assisted decoding ( #12635 )
...
* llama-server : implement universal assisted decoding
* Erase prompt tail for kv-cache
* set vocab_dft_compatible in common_speculative
* rename ctx_main to ctx_tgt
* move vocab_dft_compatible to spec struct
* clear mem_dft, remove mem
* detokenize id_last for incompatible models
* update comment
* add --spec-replace flag
* accept special tokens when translating between draft/main models
* Escape spec-replace
* clamp draft result to size to params.n_draft
* fix comment
* clean up code
* restore old example
* log common_speculative_are_compatible in speculative example
* fix
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/speculative.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-31 14:25:23 +02:00
Georgi Gerganov
745aa5319b
llama : deprecate llama_kv_self_ API ( #14030 )
...
* llama : deprecate llama_kv_self_ API
ggml-ci
* llama : allow llama_memory_(nullptr)
ggml-ci
* memory : add flag for optional data clear in llama_memory_clear
ggml-ci
2025-06-06 14:11:15 +03:00
Georgi Gerganov
e0dbec0bc6
llama : refactor llama_context, llama_kv_cache, llm_build_context ( #12181 )
...
* llama : refactor llama_context, llama_kv_cache, llm_build_context
ggml-ci
* graph : don't mutate the KV cache during defrag
ggml-ci
* context : reduce virtuals + remove test function
ggml-ci
* context : move interface implementation to source file + factory
ggml-ci
* graph : move KV cache build functions to llama_context impl
ggml-ci
* graph : remove model reference from build_pooling
ggml-ci
* graph : remove llama_model reference
ggml-ci
* kv_cache : provide rope factors
ggml-ci
* graph : rework inputs to use only unique_ptr, remove attn input abstraction
ggml-ci
* context : remove llama_context_i abstraction
ggml-ci
* context : clean-up
ggml-ci
* graph : clean-up
ggml-ci
* llama : remove redundant keywords (struct, enum)
ggml-ci
* model : adapt gemma3
ggml-ci
* graph : restore same attention ops as on master
ggml-ci
* llama : remove TODO + fix indent
ggml-ci
2025-03-13 12:35:44 +02:00
mgroeber9110
5bbe6a9fe9
ggml : portability fixes for VS 2017 ( #12150 )
...
* Add include files for std::min/max and std::toupper/tolower
* win32: move _USE_MATH_DEFINES before includes to ensure M_PI is defined
* Use GGML_RESTRICT instead of "restrict" keyword everywhere, and use "__restrict" in MSVC plain C mode
* win32: only use __restrict in MSVC if C11/C17 support is not enabled
---------
Co-authored-by: Marcus Groeber <Marcus.Groeber@cerence.com>
2025-03-04 18:53:26 +02:00
Georgi Gerganov
abd4d0bc4f
speculative : update default params ( #11954 )
...
* speculative : update default params
* speculative : do not discard the last drafted token
2025-02-19 13:29:42 +02:00
Georgi Gerganov
afa8a9ec9b
llama : add `llama_vocab`, functions -> methods, naming ( #11110 )
...
* llama : functions -> methods (#11110 )
* llama : add struct llama_vocab to the API (#11156 )
ggml-ci
* hparams : move vocab params to llama_vocab (#11159 )
ggml-ci
* vocab : more pimpl (#11165 )
ggml-ci
* vocab : minor tokenization optimizations (#11160 )
ggml-ci
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* lora : update API names (#11167 )
ggml-ci
* llama : update API names to use correct prefix (#11174 )
* llama : update API names to use correct prefix
ggml-ci
* cont
ggml-ci
* cont
ggml-ci
* minor [no ci]
* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174 )
ggml-ci
* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174 )
ggml-ci
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Georgi Gerganov
c2a16c0bdb
server : fix free of spec context and batch ( #10651 )
...
ggml-ci
2024-12-07 11:52:44 +02:00
Georgi Gerganov
9fd8c2687f
server : add more information about error ( #10455 )
2024-11-25 22:28:59 +02:00
Georgi Gerganov
d9d54e498d
speculative : refactor and add a simpler example ( #10362 )
...
* speculative : refactor and add a simpler example
ggml-ci
* speculative : clean-up and add comments and TODOs [no ci]
* speculative : manage context in common_speculative
ggml-ci
* speculative : simplify
ggml-ci
* speculative : simplify (cont)
ggml-ci
* speculative : add --draft-min CLI arg
* speculative : minor fixup
* make : build fixes
* speculative : do not redraft previous drafts
ggml-ci
* speculative : fix the draft sampling
ggml-ci
* speculative : fix compile warning
* common : refactor args
ggml-ci
* common : change defaults [no ci]
* common : final touches
ggml-ci
2024-11-25 09:58:41 +02:00