llama.cpp/tools
Sascha Rogmann 72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
..
batched-bench tool/ex/tests: consistently free ctx, then model (#18168) 2025-12-22 11:00:37 +01:00
cli common : use two decimal places for float arg help messages (#19048) 2026-01-25 07:31:42 +01:00
completion completion : fix prompt cache for recurrent models (#19045) 2026-01-25 09:12:50 +02:00
cvector-generator common : refactor common_sampler + grammar logic changes (#17937) 2025-12-14 10:11:13 +02:00
export-lora cmake : Do not install tools on iOS targets (#15903) 2025-09-16 09:54:44 +07:00
fit-params llama-fit-params: keep explicit --ctx-size 0 (#19070) 2026-01-24 22:13:08 +01:00
gguf-split cli: new CLI experience (#17824) 2025-12-10 15:28:59 +01:00
imatrix common : refactor common_sampler + grammar logic changes (#17937) 2025-12-14 10:11:13 +02:00
llama-bench Setting mmap and direct_io to false as default in llama-bench.cpp (#18841) 2026-01-16 09:46:51 +01:00
mtmd mtmd : update docs to use llama_model_n_embd_inp (#18999) 2026-01-22 14:36:32 +01:00
perplexity common : refactor common_sampler + grammar logic changes (#17937) 2025-12-14 10:11:13 +02:00
quantize quantize: prevent input/output file collision (#18451) 2025-12-31 23:29:03 +08:00
rpc Install rpc-server when GGML_RPC is ON. (#17149) 2025-11-11 10:53:59 +00:00
server spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00
tokenize cmake : Do not install tools on iOS targets (#15903) 2025-09-16 09:54:44 +07:00
tts refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
CMakeLists.txt cmake: only build cli when server is enabled (#18670) 2026-01-09 16:43:26 +01:00