llama.cpp/examples/lookup
Sascha Rogmann 72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
..
CMakeLists.txt ggml : move AMX to the CPU backend (#10570) 2024-11-29 21:54:58 +01:00
README.md repo : update links to new url (#11886) 2025-02-15 16:40:57 +02:00
lookup-create.cpp spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00
lookup-merge.cpp common : use common_ prefix for common library functions (#9805) 2024-10-10 22:57:42 +02:00
lookup-stats.cpp spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00
lookup.cpp spec : add self‑speculative decoding (no draft model required) + refactor (#18471) 2026-01-28 19:42:42 +02:00

README.md

llama.cpp/examples/lookup

Demonstration of Prompt Lookup Decoding

https://github.com/apoorvumang/prompt-lookup-decoding

The key parameters for lookup decoding are ngram_min, ngram_max and n_draft. The first two determine the size of the ngrams to search for in the prompt for a match. The latter specifies how many subsequent tokens to draft if a match is found.

More info:

https://github.com/ggml-org/llama.cpp/pull/4484 https://github.com/ggml-org/llama.cpp/issues/4226