gemma.cpp/ops
Jan Wassenberg 301dc8067a Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul
Supports converting all weight/activation formats to native MulT (bf16/f32)

Also:
- ConstMat/MutableMat for const correctness
- Move RowVectorBatch to allocator.h so it can be used from Matmul
- Add matmul.h so MatMulEnv can be used from Activations
- Remove kMaxThreads, detect from PerClusterPools
- Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h

```
zen4 new
64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp:   616.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp:   460.7 GFLOPS.
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    598.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    435.6 GFLOPS.

zen4 old
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    257.5 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    231.9 GFLOPS.
```

PiperOrigin-RevId: 663729812
2024-08-16 07:52:20 -07:00
..
gemma_matvec_test.cc Fix build issues when tests are enabled 2024-08-12 18:50:23 +02:00
matmul-inl.h Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul 2024-08-16 07:52:20 -07:00
matmul.h Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul 2024-08-16 07:52:20 -07:00
matmul_test.cc Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul 2024-08-16 07:52:20 -07:00
matvec-inl.h Split matmul into matvec; add large matrix benchmark 2024-07-30 08:29:11 -07:00
ops-inl.h 1.03-1.08x decode speedup: precompute Rope theta, fuse 2024-08-09 01:23:24 -07:00
ops_test.cc Split up ops.h into ops/ops-inl and matmul-inl 2024-07-19 11:21:48 -07:00