mirror of https://github.com/google/gemma.cpp.git
Move the loop over the tokens inside the attention block and
then create kHeads * num_tokens threads.
This helps the multi-threaded speed only in case of the 2b gemma
model, but to be consistent we move the loop over the tokens inside
the griffin recurrent layer and the FFW layer as well. This is
also a preparation for using the MatMul operation later.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed
Num threads BEFORE AFTER
32 61.76 t/s 65.08 t/s
64 89.46 t/s 98.62 t/s
```
|
||
|---|---|---|
| .. | ||
| benchmark.cc | ||
| compress_weights.cc | ||
| configs.h | ||
| gemma.cc | ||
| gemma.h | ||
| gemma_test.cc | ||
| ops.h | ||
| ops_test.cc | ||
| run.cc | ||