gemma.cpp/gemma
Zoltan Szabadka 3d72f17261 Use more parallelism in attention block in prefill mode.
Move the loop over the tokens inside the attention block and
then create kHeads * num_tokens threads.

This helps the multi-threaded speed only in case of the 2b gemma
model, but to be consistent we move the loop over the tokens inside
the griffin recurrent layer and the FFW layer as well. This is
also a preparation for using the MatMul operation later.

Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):

```
                   Prefill speed
Num threads      BEFORE       AFTER
32               61.76 t/s    65.08 t/s
64               89.46 t/s    98.62 t/s
```
2024-05-03 13:23:07 +00:00
..
benchmark.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
compress_weights.cc Improve documentation for compress_weights flags 2024-04-29 06:49:50 -07:00
configs.h Add per-thread even_odd storage for #166. 2024-04-30 10:42:23 -07:00
gemma.cc Use more parallelism in attention block in prefill mode. 2024-05-03 13:23:07 +00:00
gemma.h Use more parallelism in the QKV projections in MQA mode. 2024-04-30 13:10:14 +00:00
gemma_test.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
ops.h Merge pull request #166 from samkaufman:deinterleave-vecs 2024-05-03 05:23:31 -07:00
ops_test.cc Use more parallelism in the QKV projections of the MHA block. 2024-05-02 13:46:45 +00:00
run.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00