gemma.cpp/gemma
Zoltan Szabadka 9a2682d544 Use more parallelism in the QKV projections of the MHA block.
We compute all three projections with one MatVec and then copy
the kv part to the cache.

Benchmark results for 7b-it model that uses MHA blocks (summarization with
1600 tokens for prefill and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               13.75 t/s    14.80 t/s       9.22 t/s     9.77 t/s
64               19.89 t/s    24.83 t/s      12.46 t/s    13.66 t/s
```
2024-05-02 13:46:45 +00:00
..
benchmark.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
compress_weights.cc Improve documentation for compress_weights flags 2024-04-29 06:49:50 -07:00
configs.h Add per-thread even_odd storage for #166. 2024-04-30 10:42:23 -07:00
gemma.cc Use more parallelism in the QKV projections of the MHA block. 2024-05-02 13:46:45 +00:00
gemma.h Use more parallelism in the QKV projections in MQA mode. 2024-04-30 13:10:14 +00:00
gemma_test.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
ops.h Use more parallelism in the QKV projections of the MHA block. 2024-05-02 13:46:45 +00:00
ops_test.cc Use more parallelism in the QKV projections of the MHA block. 2024-05-02 13:46:45 +00:00
run.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00