gemma.cpp

History

Zoltan Szabadka 9a2682d544 Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```		2024-05-02 13:46:45 +00:00
..
benchmark.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
compress_weights.cc	Improve documentation for compress_weights flags	2024-04-29 06:49:50 -07:00
configs.h	Add per-thread even_odd storage for #166 .	2024-04-30 10:42:23 -07:00
gemma.cc	Use more parallelism in the QKV projections of the MHA block.	2024-05-02 13:46:45 +00:00
gemma.h	Use more parallelism in the QKV projections in MQA mode.	2024-04-30 13:10:14 +00:00
gemma_test.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
ops.h	Use more parallelism in the QKV projections of the MHA block.	2024-05-02 13:46:45 +00:00
ops_test.cc	Use more parallelism in the QKV projections of the MHA block.	2024-05-02 13:46:45 +00:00
run.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00