Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc
PiperOrigin-RevId: 640869202
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.
Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.
Benchmark results on a 1600-token summarization task:
```
Prefill speed
Num threads BEFORE AFTER
4 9.24 t/s 9.76 t/s
18 31.41 t/s 31.16 t/s
32 31.41 t/s 45.13 t/s
64 31.03 t/s 57.85 t/s
```