73 lines
2.3 KiB
Markdown
73 lines
2.3 KiB
Markdown
## duo
|
|
|
|
This is a demo of an approach of distributed evaluation/speculation using rpc.
|
|
|
|
It is a fairly minimal app, and many more improvements could be made.
|
|
|
|
### Idea
|
|
|
|
Idea is coming from discussion here: https://github.com/ggerganov/llama.cpp/discussions/6853#discussioncomment-9473494.
|
|
When we run a large model and distribute the evaluation across multiple devices, they still evaluate model sequentially.
|
|
In case of two identical devices and equal model split we would leave half of compute on the table, assuming individual use-case (e.g. personal chat).
|
|
|
|
We can utilize this compute to speculate and then evaluate larger sequence of tokens.
|
|
|
|
This demo is fairly limited, more like a proof of concept:
|
|
1. Expects exactly two instances running main model
|
|
2. Only one of these instances speculating when main model is idle, so we still waste 25% of compute
|
|
3. Speculation is linear
|
|
4. Sampling is greedy
|
|
|
|
Improvement of the above points is probably easier to do as separate changes, to make reviewing easier.
|
|
|
|
### Setup
|
|
|
|
Devices:
|
|
* Apple M1 16GB
|
|
* Apple M2 24GB
|
|
* Connected with thunderbolt-4 cable and using TCP/IP over thunderbolt.
|
|
|
|
Models:
|
|
* Meta-Llama-3-8B-Instruct-fp16 as main
|
|
* Meta-Llama-3-8B-Instruct-v2.Q2_K as speculation
|
|
|
|
We could use different models as well.
|
|
|
|
On M1
|
|
```
|
|
bin/rpc-server -p 10001 -m 10000
|
|
```
|
|
|
|
On M2
|
|
```
|
|
bin/rpc-server -p 10001 -m 10000
|
|
bin/rpc-server -p 20002 -m 4000
|
|
```
|
|
|
|
Also on M2:
|
|
```
|
|
./bin/duo -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf -md ../../llms/gguf/Meta-Llama-3-8B-Instruct-v2.Q2_K.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99 -t 1 --rpcd "localhost:20002"
|
|
|
|
...
|
|
llama_print_timings: load time = 42068.04 ms
|
|
...
|
|
llama_print_timings: total time = 42792.74 ms / 302 tokens
|
|
|
|
```
|
|
|
|
Seems like eval time is messed up a little
|
|
|
|
Compare that with running main with same 2 rpc servers:
|
|
```
|
|
./bin/main -m ../../llms/gguf/Meta-Llama-3-8B-Instruct-fp16.gguf --rpc "localhost:10001,169.254.77.16:10001" -p "Please illustrate the difference between concurrency and parallelism in python." -n 256 -ngl 99
|
|
...
|
|
llama_print_timings: load time = 42305.61 ms
|
|
...
|
|
llama_print_timings: total time = 58555.49 ms / 268 tokens
|
|
```
|
|
|
|
Extra:
|
|
|
|
GPU util for both devices
|
|
|