gemma.cpp

History

Zoltan Szabadka afaca4efa8 Use more parallelism in the QKV projections in MQA mode. Instead of MatVecLoop, we use MatVec and we combine k and v into one 2 * kQKVDim long vector so that K and V projections can be combined into one MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s 18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s 32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s 64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s ```		2024-04-30 13:10:14 +00:00
..
benchmark.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
compress_weights.cc	Improve documentation for compress_weights flags	2024-04-29 06:49:50 -07:00
configs.h	Support absolute positional embeddings from vanilla transformer	2024-04-25 09:32:14 -07:00
gemma.cc	Use more parallelism in the QKV projections in MQA mode.	2024-04-30 13:10:14 +00:00
gemma.h	Use more parallelism in the QKV projections in MQA mode.	2024-04-30 13:10:14 +00:00
gemma_test.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00
ops.h	Move code to gemma/ so we can remove error-prone copybara: comments.	2024-04-09 04:45:42 -07:00
ops_test.cc	Move code to gemma/ so we can remove error-prone copybara: comments.	2024-04-09 04:45:42 -07:00
run.cc	Simplify threading: remove the use of inner_pool.	2024-04-29 16:07:30 +00:00