This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.
Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
Using a restricted Kaggle account, this code:
- Adds an Ubuntu 20.04 build (required for glibc compat with Kaggle infra)
- Uploads the ubuntu-20.04 build and supporting library to a Kaggle dataset using a fork of `push-kaggle-dataset`
- Creates a new version of a Kaggle notebook that loads artifacts from the Kaggle Model Hub, along with the newly updated dataset, and validates a 2b-it-sfp model.
- Runs the notebook and throws an error if the process does not complete, raises an exception, or produces an invalid response.
Todo: add tests / capabilities to the smoke tests used by the notebook.
Remove extra Dot() overload
MatVecAdd always adds, use MatVecT<kAdd> if conditional.
Remove ununsed MatVecAddLoop and MatVecLoop
No longer tsan-verify even_odd
PiperOrigin-RevId: 631377279
Move the loop over the tokens inside the attention block and
then create kHeads * num_tokens threads.
This helps the multi-threaded speed only in case of the 2b gemma
model, but to be consistent we move the loop over the tokens inside
the griffin recurrent layer and the FFW layer as well. This is
also a preparation for using the MatMul operation later.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed
Num threads BEFORE AFTER
32 61.76 t/s 65.08 t/s
64 89.46 t/s 98.62 t/s
```
We compute all three projections with one MatVec and then copy
the kv part to the cache.
Benchmark results for 7b-it model that uses MHA blocks (summarization with
1600 tokens for prefill and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s
64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s
```
We use MatVec instead of MatVecLoop for the per-head dense layers,
because we can parallelize more on the rows of the matrix than
on the number of heads. This will be even more efficient after
we rearrange the weights and can have a single MatVec operation.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s
64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s
```