185 lines
8.4 KiB
Markdown
185 lines
8.4 KiB
Markdown
# Speculative Decoding
|
|
|
|
llama.cpp supports speculative decoding, a technique that can significantly accelerate token generation by predicting multiple tokens ahead of the main model.
|
|
|
|
[Speculative decoding](https://en.wikipedia.org/wiki/Transformer_(deep_learning)#Speculative_decoding) leverages the fact that computing n tokens in a batch (as in prompt processing) is more efficient than computing n sequentially (as in response generation). By generating draft tokens quickly and then verifying them with the target model in a single batch, this approach can achieve substantial speedups when the draft predictions are frequently correct.
|
|
|
|
## Implementations
|
|
|
|
The `llama-server` application supports several implementations of speculative decoding. An implementation with draft model can be mixed with an implementation without draft model.
|
|
|
|
### Draft Model (`draft`)
|
|
|
|
A much smaller model (called the _draft model_) generates drafts.
|
|
A draft model is the most used approach in speculative decoding.
|
|
|
|
### n-gram Cache (`ngram-cache`)
|
|
|
|
An n-gram is a sequence of n tokens. The n-gram cache implementation maintains statistics about short n-gram sequences.
|
|
A draft is computed using probabilities derived from these statistics. External statistics can also be loaded from files for improved accuracy.
|
|
|
|
See:
|
|
|
|
- #5479, #6828, #6848
|
|
|
|
### n-gram Map (`ngram-simple`, `ngram-map-*`)
|
|
|
|
These implementations search the token history for patterns and use matching sequences as draft candidates.
|
|
They require no additional model but rely on patterns that have already appeared in the generated text.
|
|
An example to use this approach can be the rewriting of source code by a LLM.
|
|
|
|
#### n-gram Map (`ngram-simple`)
|
|
|
|
This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.
|
|
|
|
```
|
|
llama-server [...] --spec-type ngram-simple --draft-max 64
|
|
```
|
|
|
|
#### n-gram Map Key (`ngram-map-k`)
|
|
|
|
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-min-hits`, default is 1) before generating drafts.
|
|
|
|
The number of accepted tokens is stored for each used n-gram.
|
|
|
|
**Example:**
|
|
```
|
|
llama-server [...] --spec-type ngram-map-k --draft-max 64
|
|
```
|
|
|
|
#### n-gram Map Key-4-Values (`ngram-map-k4v`)
|
|
|
|
This experimental implementation looks for the current n-gram of size n (called the _key_) in the token history. For each key, up to four _values_ (n-grams of size m, called _mgrams_) are tracked. An internal statistic counts the occurrences of each mgram after the key n-gram. If one mgram is significantly more frequent than the others, it is used as the draft.
|
|
|
|
The number of accepted tokens is stored for each used n-gram.
|
|
|
|
**Example:** Server options to be used if there are a lot of longer repetitions.
|
|
```
|
|
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2 --draft-max 64
|
|
```
|
|
|
|
### n-gram Mod (`ngram-mod`)
|
|
|
|
Add basic ngram hasher for speculative decoding:
|
|
|
|
- For each ngram, compute a hash using LCG
|
|
- For each computed hash, store the next token
|
|
- During speculation, iteratively compute the rolling hash of the last n tokens and pick the next token from the storage
|
|
|
|
Some characteristics:
|
|
|
|
- Lightweight (~16 MB)
|
|
- Constant memory and complexity
|
|
- Can generate variable draft lengths (i.e. m is not fixed)
|
|
|
|
Currently, a single hash pool is shared across all server slots, so different requests can benefit from each other.
|
|
|
|
**Sample usage:**
|
|
|
|
```
|
|
# notes:
|
|
# - small `n` are not recommended
|
|
# - MoEs require long drafts
|
|
# - dense models: can reduce `--draft-min` and `--draft-max`
|
|
|
|
llama-server ... --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
|
|
```
|
|
|
|
Applications:
|
|
|
|
- Iterating over a block of text/code (e.g. in llama.vim)
|
|
- Reasoning models (when they have to repeat their thinking in the final answer)
|
|
- Summarization
|
|
|
|
Example Video:
|
|
|
|
- See #19164
|
|
|
|
### Differences between ngram-simple, ngram-map and ngram-mod
|
|
|
|
- ngram-simple looks for a previous matching n-gram and inserts the following m-gram.
|
|
- ngram-map-k looks for a previous matching n-gram and inserts the following m-gram but uses an internal hash-map of n-grams in the current context window.
|
|
- ngram-mod uses a hash pool which is shared across all server slots. The hash pool is a map from n-gram hash to the next token (not the next m-gram as in ngram-map).
|
|
|
|
## Command-Line Options
|
|
|
|
If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
|
|
|
|
```
|
|
--draft, --draft-n, --draft-max N number of tokens to draft for speculative decoding (default: 16)
|
|
(env: LLAMA_ARG_DRAFT_MAX)
|
|
--draft-min, --draft-n-min N minimum number of draft tokens to use for speculative decoding
|
|
(default: 0)
|
|
(env: LLAMA_ARG_DRAFT_MIN)
|
|
[...]
|
|
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v|ngram-mod]
|
|
type of speculative decoding to use when no draft model is provided
|
|
(default: none)
|
|
--spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length
|
|
of lookup n-gram (default: 12)
|
|
--spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length
|
|
of draft m-gram (default: 48)
|
|
--spec-ngram-check-rate N ngram check rate for ngram-simple/ngram-map speculative decoding
|
|
(default: 1)
|
|
--spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1)
|
|
```
|
|
|
|
### `--spec-type TYPE`
|
|
|
|
Specifies a type of speculative decoding without draft model.
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| `none` | No speculative decoding (default) |
|
|
| `ngram-cache` | Use n-gram cache lookup |
|
|
| `ngram-simple` | Use simple n-gram pattern matching |
|
|
| `ngram-map-k` | Use n-gram pattern matching with n-gram-keys |
|
|
| `ngram-map-k4v` | Use n-gram pattern matching with n-gram-keys and up to four m-gram values (experimental) |
|
|
| `ngram-mod` | Use basic ngram hasher for speculative decoding with shared pool |
|
|
|
|
**Example:** Server-instance used to refactor source code.
|
|
```bash
|
|
./llama-server [...] --spec-type ngram-simple
|
|
```
|
|
|
|
### `--spec-ngram-size-n N`
|
|
|
|
Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
|
|
The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
|
|
|
|
### `--spec-ngram-size-m M`
|
|
|
|
Sets the size M of the draft m-gram for n-gram map based speculative decoding.
|
|
The m-gram size determines how many tokens to draft when a match is found.
|
|
Larger values can provide more speedup but may reduce acceptance rate.
|
|
|
|
### `--spec-ngram-check-rate R`
|
|
|
|
This option aims at performance if the n-gram lookup in history is to costly. A lookup will be executed at every R tokens (default is 1, every token).
|
|
|
|
### `--spec-ngram-min-hits H`
|
|
|
|
This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
|
|
|
|
## Statistics
|
|
Each speculative decoding implementation prints statistics.
|
|
|
|
```
|
|
draft acceptance rate = 0.57576 ( 171 accepted / 297 generated)
|
|
statistics ngram_simple: #calls = 15, #gen drafts = 5, #acc drafts = 5, #gen tokens = 187, #acc tokens = 73
|
|
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10, #gen tokens = 110, #acc tokens = 98
|
|
```
|
|
|
|
```
|
|
draft acceptance rate = 0.70312 ( 90 accepted / 128 generated)
|
|
statistics ngram_mod: #calls = 810, #gen drafts = 15, #acc drafts = 15, #gen tokens = 960, #acc tokens = 730, dur(b,g,a) = 0.149, 0.347, 0.005 ms
|
|
```
|
|
|
|
- `#calls`: number of calls of this implementations
|
|
- `#gen drafts`: number of drafts generated by this implementation
|
|
- `#acc drafts`: number of drafts accepted (partially) by the main model
|
|
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
|
|
- `#acc tokens`: number of tokens accepted by the main model
|
|
- `dur(b,g,a): durations of begin (new prompt), generation and accumulation (process acceptance).
|
|
|