mirror of https://github.com/google/gemma.cpp.git
Instead of MatVecLoop, we use MatVec and we combine k and v
into one 2 * kQKVDim long vector so that K and V projections
can be combined into one MatVec operation.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s
18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s
32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s
64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s
```
|
||
|---|---|---|
| .. | ||
| benchmark.cc | ||
| compress_weights.cc | ||
| configs.h | ||
| gemma.cc | ||
| gemma.h | ||
| gemma_test.cc | ||
| ops.h | ||
| ops_test.cc | ||
| run.cc | ||