gemma.cpp

History

Zoltan Szabadka 3d72f17261 Use more parallelism in attention block in prefill mode. Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads. This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Num threads BEFORE AFTER 32 61.76 t/s 65.08 t/s 64 89.46 t/s 98.62 t/s ```		2024-05-03 13:23:07 +00:00
..
benchmark.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
compress_weights.cc	Improve documentation for compress_weights flags	2024-04-29 06:49:50 -07:00
configs.h	Add per-thread even_odd storage for #166 .	2024-04-30 10:42:23 -07:00
gemma.cc	Use more parallelism in attention block in prefill mode.	2024-05-03 13:23:07 +00:00
gemma.h	Use more parallelism in the QKV projections in MQA mode.	2024-04-30 13:10:14 +00:00
gemma_test.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
ops.h	Merge pull request #166 from samkaufman:deinterleave-vecs	2024-05-03 05:23:31 -07:00
ops_test.cc	Use more parallelism in the QKV projections of the MHA block.	2024-05-02 13:46:45 +00:00
run.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00