gemma.cpp

History

Zoltan Szabadka 0afa480d90 Use more parallelism in the final output of the attention block. We use MatVec instead of MatVecLoop for the per-head dense layers, because we can parallelize more on the rows of the matrix than on the number of heads. This will be even more efficient after we rearrange the weights and can have a single MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s 64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s ```		2024-05-02 09:30:07 +00:00
..
benchmark.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
compress_weights.cc	Improve documentation for compress_weights flags	2024-04-29 06:49:50 -07:00
configs.h	Add per-thread even_odd storage for #166 .	2024-04-30 10:42:23 -07:00
gemma.cc	Use more parallelism in the final output of the attention block.	2024-05-02 09:30:07 +00:00
gemma.h	Use more parallelism in the QKV projections in MQA mode.	2024-04-30 13:10:14 +00:00
gemma_test.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
ops.h	Add per-thread even_odd storage for #166 .	2024-04-30 10:42:23 -07:00
ops_test.cc	Add per-thread even_odd storage for #166 .	2024-04-30 10:42:23 -07:00
run.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00