gemma.cpp/gemma
Zoltan Szabadka 0afa480d90 Use more parallelism in the final output of the attention block.
We use MatVec instead of MatVecLoop for the per-head dense layers,
because we can parallelize more on the rows of the matrix than
on the number of heads. This will be even more efficient after
we rearrange the weights and can have a single MatVec operation.

Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               58.24 t/s    61.79 t/s      32.11 t/s    32.62 t/s
64               83.62 t/s    92.00 t/s      41.10 t/s    41.80 t/s
```
2024-05-02 09:30:07 +00:00
..
benchmark.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
compress_weights.cc Improve documentation for compress_weights flags 2024-04-29 06:49:50 -07:00
configs.h Add per-thread even_odd storage for #166. 2024-04-30 10:42:23 -07:00
gemma.cc Use more parallelism in the final output of the attention block. 2024-05-02 09:30:07 +00:00
gemma.h Use more parallelism in the QKV projections in MQA mode. 2024-04-30 13:10:14 +00:00
gemma_test.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00
ops.h Add per-thread even_odd storage for #166. 2024-04-30 10:42:23 -07:00
ops_test.cc Add per-thread even_odd storage for #166. 2024-04-30 10:42:23 -07:00
run.cc Simplify threading: remove the use of inner_pool. 2024-04-29 16:07:30 +00:00