121 lines
5.9 KiB
Markdown
121 lines
5.9 KiB
Markdown
# Speculative Decoding
|
|
|
|
llama.cpp supports speculative decoding, a technique that can significantly accelerate token generation by predicting multiple tokens ahead of the main model.
|
|
|
|
[Speculative decoding](https://en.wikipedia.org/wiki/Transformer_(deep_learning)#Speculative_decoding) leverages the fact that computing n tokens in a batch (as in prompt processing) is more efficient than computing n sequentially (as in response generation). By generating draft tokens quickly and then verifying them with the target model in a single batch, this approach can achieve substantial speedups when the draft predictions are frequently correct.
|
|
|
|
## Implementations
|
|
|
|
The `llama-server` application supports several implementations of speculative decoding:
|
|
|
|
### Draft Model (`draft`)
|
|
|
|
A much smaller model (called the _draft model_) generates drafts.
|
|
A draft model is the most used approach in speculative decoding.
|
|
|
|
### n-gram Cache (`ngram-cache`)
|
|
|
|
An n-gram is a sequence of n tokens. The n-gram cache implementation maintains statistics about short n-gram sequences.
|
|
A draft is computed using probabilities derived from these statistics. External statistics can also be loaded from files for improved accuracy.
|
|
|
|
See:
|
|
|
|
- #5479, #6828, #6848
|
|
|
|
### n-gram Map (`ngram-simple`, `ngram-map-*`)
|
|
|
|
These implementations search the token history for patterns and use matching sequences as draft candidates.
|
|
They require no additional model but rely on patterns that have already appeared in the generated text.
|
|
An example to use this approach can be the rewriting of source code by a LLM.
|
|
|
|
#### n-gram Map (`ngram-simple`)
|
|
|
|
This implementation looks for the last n-gram in history that matches the current n-gram and creates a draft using the m tokens following the matched n-gram. It is the simplest self-speculative approach with minimal overhead.
|
|
|
|
#### n-gram Map Key (`ngram-map-k`)
|
|
|
|
This implementation looks for the current n-gram of size n (called the _key_) in the token history. If the key n-gram is followed by the same m tokens (called the _mgram_) multiple times, it creates a draft using these m tokens. This approach requires a minimum number of occurrences (argument `--spec-ngram-min-hits`) before generating drafts.
|
|
|
|
The number of accepted tokens is stored for each used n-gram.
|
|
|
|
#### n-gram Map Key-4-Values (`ngram-map-k4v`)
|
|
|
|
This experimental implementation looks for the current n-gram of size n (called the _key_) in the token history. For each key, up to four _values_ (n-grams of size m, called _mgrams_) are tracked. An internal statistic counts the occurrences of each mgram after the key n-gram. If one mgram is significantly more frequent than the others, it is used as the draft.
|
|
|
|
The number of accepted tokens is stored for each used n-gram.
|
|
|
|
**Example:** Server options to be used if there are a lot of longer repetitions.
|
|
```bash
|
|
llama-server [...] --spec-type ngram-map-k4v --spec-ngram-size-n 8 --spec-ngram-size-m 8 --spec-ngram-min-hits 2
|
|
```
|
|
|
|
|
|
## Command-Line Options
|
|
|
|
If a draft model is combined with a draftless decoding the draftless decoding has higher precedence.
|
|
|
|
```
|
|
--spec-type [none|ngram-cache|ngram-simple|ngram-map-k|ngram-map-k4v]
|
|
type of speculative decoding to use when no draft model is provided
|
|
(default: none)
|
|
--spec-ngram-size-n N ngram size N for ngram-simple/ngram-map speculative decoding, length
|
|
of lookup n-gram (default: 12)
|
|
--spec-ngram-size-m N ngram size M for ngram-simple/ngram-map speculative decoding, length
|
|
of draft m-gram (default: 48)
|
|
--spec-ngram-check-rate N ngram check rate for ngram-simple/ngram-map speculative decoding
|
|
(default: 1)
|
|
--spec-ngram-min-hits N minimum hits for ngram-map speculative decoding (default: 1)
|
|
```
|
|
|
|
### `--spec-type TYPE`
|
|
|
|
Specifies a type of speculative decoding without draft model.
|
|
|
|
| Type | Description |
|
|
|------|-------------|
|
|
| `none` | No speculative decoding (default) |
|
|
| `ngram-cache` | Use n-gram cache lookup |
|
|
| `ngram-simple` | Use simple n-gram pattern matching |
|
|
| `ngram-map-k` | Use n-gram pattern matching with n-gram-keys |
|
|
| `ngram-map-k4v` | Use n-gram pattern matching with n-gram-keys and up to four m-gram values (experimental) |
|
|
|
|
**Example:** Server-instance used to refactor source code.
|
|
```bash
|
|
./llama-server [...] --spec-type ngram-simple
|
|
```
|
|
|
|
### `--spec-ngram-size-n N`
|
|
|
|
Sets the size N of the lookup n-gram for n-gram map based speculative decoding.
|
|
The n-gram size N determines how many tokens in a row to look back when searching for matching patterns.
|
|
|
|
### `--spec-ngram-size-m M`
|
|
|
|
Sets the size M of the draft m-gram for n-gram map based speculative decoding.
|
|
The m-gram size determines how many tokens to draft when a match is found.
|
|
Larger values can provide more speedup but may reduce acceptance rate.
|
|
|
|
### `--spec-ngram-check-rate R`
|
|
|
|
This option aims at performance if the n-gram lookup in history is to costly. A lookup will be executed at every R tokens (default is 1, every token).
|
|
|
|
### `--spec-ngram-min-hits H`
|
|
|
|
This option defines how often a key has to appear in the token history to be used as a draft (default is 1).
|
|
|
|
## Statistics
|
|
Each speculative decoding implementation prints statistics.
|
|
|
|
```
|
|
draft acceptance rate = 0.57576 ( 171 accepted / 297 generated)
|
|
statistics ngram_simple: #calls = 15, #gen drafts = 5, #acc drafts = 5, #gen tokens = 187, #acc tokens = 73
|
|
statistics draft: #calls = 10, #gen drafts = 10, #acc drafts = 10, #gen tokens = 110, #acc tokens = 98
|
|
```
|
|
|
|
- `#calls`: number of calls of this implementations
|
|
- `#gen drafts`: number of drafts generated by this implementation
|
|
- `#acc drafts`: number of drafts accepted (partially) by the main model
|
|
- `#gen tokens`: number of tokens generated by this implementation (including rejected tokens)
|
|
- `#acc tokens`: number of tokens accepted by the main model
|
|
|