llama.cpp/examples/simple-token-healing/README.md

106 lines
2.7 KiB
Markdown

# llama.cpp/example/simple-token-healing
This example extends [simple](../simple/README.md) with token healing (aka. token alignment).
`usage: ./simple-token-healing MODEL_PATH [PROMPT] [TOKEN_HEALING 0|1|d1|d|r[N]]`
## Examples
`0`: Without token healing (same as running `./simple ...`):
```bash
./simple-token-healing ./models/phi-2/ggml-model-q4_0.gguf "print('Hel" 0
...
main: n_len = 32, n_ctx = 2048, n_kv_req = 32
print('Helping the customer')
...
```
`1`: Roll back the last token and constrain the bytes of the next token to start with the chopped off last token [0, 2]:
```bash
./simple-token-healing ./models/phi-2/ggml-model-q4_0.gguf "print('Hel" 1
...
token_healing: prefix = 'Hel' (1 tokens)
[ 12621] 'Hel'
[ 15496] 'Hello'
[ 22087] 'Help'
[ 28254] 'Hell'
[ 47429] 'Helper'
main: n_len = 32, n_ctx = 2048, n_kv_req = 32
print('Hello, World!')
...
```
`d1`: Roll back multiple tokens until there doesn't exist a token which can cover the prompt's suffix and do a single constrained decoding step [2]:
```bash
./simple-token-healing ./models/phi-2/ggml-model-q4_0.gguf "print('Hello, worl" d1
...
token_healing: prefix = ' worl' (2 tokens)
[ 995] ' world'
[ 8688] ' worldwide'
[ 11621] ' worlds'
[ 29081] ' worldview'
[ 43249] ' worldly'
main: n_len = 32, n_ctx = 2048, n_kv_req = 32
print('Hello, world!')
...
```
`d`: Roll back multiple tokens until there doesn't exist a token which can cover the prompt's suffix but allow multiple decoding steps:
```bash
./simple-token-healing ./models/phi-2/ggml-model-q4_0.gguf "print('Hello, worl" d
...
token_healing: prefix = ' worl' (2 tokens)
main: n_len = 32, n_ctx = 2048, n_kv_req = 32
print('Hello,
token_healing: prefix = ' worl'
[ 220] ' '
[ 266] ' w'
[ 476] ' wor'
[ 995] ' world'
[ 8688] ' worldwide'
[ 11621] ' worlds'
[ 24486] ' wo'
[ 29081] ' worldview'
[ 43249] ' worldly'
world!')
...
```
`r[N]`: Roll back `N` tokens and constrain the decoding to the bytes of those tokens (multiple decoding steps) [1].
The paper [1] recommends `N=3`:
```bash
./simple-token-healing ./models/phi-2/ggml-model-q4_0.gguf "print('Hello, worl" r3
...
token_healing: prefix = ', worl' (3 tokens)
main: n_len = 32, n_ctx = 2048, n_kv_req = 32
print('Hello
token_healing: prefix = ', worl'
[ 11] ','
,
token_healing: prefix = ' worl'
[ 220] ' '
[ 266] ' w'
[ 476] ' wor'
[ 995] ' world'
[ 8688] ' worldwide'
[ 11621] ' worlds'
[ 24486] ' wo'
[ 29081] ' worldview'
[ 43249] ' worldly'
world!')
...
```
## Sources
- [0] https://github.com/guidance-ai/guidance/blob/main/notebooks/art_of_prompt_design/prompt_boundaries_and_token_healing.ipynb
- [1] https://arxiv.org/abs/2403.08688
- [2] https://arxiv.org/abs/2402.01035