Commit Graph

48 Commits

Author SHA1 Message Date
Jan Wassenberg 85cac13fb1 Split up ops.h into ops/ops-inl and matmul-inl
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Jan Wassenberg edaf61b983 SVE build fix: avoid capturing vectors directly.
Also use more V typedef instead of auto.

PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg be765afce2 Simplify matmul: only 2 overloads
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.

1.02-1.06x speedup.

PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov 3e92088595 Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.

Measurements for a 2b-it sfp-encoded model on a  AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00
Jan Wassenberg e588a7f45d Add config for att/final cap, skip max-subtract. Fixes #278
Also update includes/deps for backprop/.

PiperOrigin-RevId: 648399222
2024-07-01 09:45:26 -07:00
Jan Wassenberg 48ebba8b7a Code cleanup
- Simplify template arg list, enable deduction
- missing hn:: on " Lanes"
- 1.0f suffix
- move RMSNormBatched into ops.h
- static constexpr -> constexpr
- concrete type instead of LayerT, WeightArrayT
- inline GetWeights
- remove if (runtime_config.verbosity
- merge AllocatePrefill and AllocateDecode
- remove bf_ffw_hidden

PiperOrigin-RevId: 644931277
2024-06-20 01:10:24 -07:00
Jan Wassenberg a07f60c9a1 1.15x 7b sfp prefill speedup: Matmul in attention
2b bf16:
prefill 114.456 -> 115.222
decode  16.8847 -> 16.9987

7b sfp:
prefill 18.8575 -> 21.7325
decode 5.68428 -> 5.79791

PiperOrigin-RevId: 644283676
2024-06-18 01:00:51 -07:00
Ray Smith e0afdfa8fb Added bias vector addition to MatMul
PiperOrigin-RevId: 643385381
2024-06-14 10:25:16 -07:00
Jan Wassenberg 29c0c574e6 Integrate matmul into FFW: 4.3x prefill speedup
```
before, bf16:
27.2929 prefill tokens / sec
17.2114 tokens / sec

after, bf16
116.496 prefill tokens / sec
17.5391 tokens / sec
```

PiperOrigin-RevId: 643328437
2024-06-14 06:32:26 -07:00
Ray Smith 198326a682 Removed now redundant non-batch matmul
PiperOrigin-RevId: 643317187
2024-06-14 05:13:36 -07:00
Andrey Vlasov b17631c95f Implement a missing (bf16, f32) tiled MatMul kernel.
PiperOrigin-RevId: 643313676
2024-06-14 04:54:40 -07:00
Jan Wassenberg d3c6a45b59 Major duplicated code reduction in test/benchmarks
Helper functions to tokenize/wrap
Move LayersOutputFunc into RuntimeConfig
AcceptFunc passes the probability
Implement StringFromType using the parser, and verify results match

PiperOrigin-RevId: 643255119
2024-06-14 00:16:25 -07:00
Ray Smith ea525da967 Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time.
PiperOrigin-RevId: 643017973
2024-06-13 09:05:40 -07:00
Andrey Vlasov 38eb452b94 Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32,
sfp) tiled MatMul.

PiperOrigin-RevId: 642901844
2024-06-13 02:07:21 -07:00
The gemma.cpp Authors 2a0e6ee976 Fix numerical issue in Softcap by subtracting max.
Also update test threshold.

PiperOrigin-RevId: 642587468
2024-06-12 05:42:16 -07:00
The gemma.cpp Authors f467670de7 Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix.
PiperOrigin-RevId: 642533996
2024-06-12 01:11:59 -07:00
Jan Wassenberg 3e2396f98c Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
2024-06-11 05:53:10 -07:00
Daniel Keysers c557ad23a8 Adds simple-loop versions of missing batched functions.
PiperOrigin-RevId: 642189741
2024-06-11 02:14:02 -07:00
Paul Chang 6c0be20fa6 Fix Softmax on SVE
PiperOrigin-RevId: 640947138
2024-06-06 10:39:30 -07:00
The gemma.cpp Authors 39d4115717 Implement mixed mode matmul: f32 * bf16
PiperOrigin-RevId: 640940962
2024-06-06 10:21:46 -07:00
Jan Wassenberg 57c2cd8b52 Simplifications: remove GemmaInterface and GemmaImpl
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc

PiperOrigin-RevId: 640869202
2024-06-06 05:54:21 -07:00
Paul Chang 175e389c3c revert back to HWY_ASSERT for lane constraints, qualify hn::Add
PiperOrigin-RevId: 640193239
2024-06-04 10:10:18 -07:00
Jan Wassenberg 4f9155d8c6 Add bf16 matmul support, update naming+test
Avoid int32, which can easily overflow for large matrices.
Also fix IDE warning in sfp-inl.

PiperOrigin-RevId: 640149845
2024-06-04 07:41:46 -07:00
Zoltan Szabadka 8567978541 Adress review comments 2024-06-04 08:37:54 +00:00
Zoltan Szabadka 36e4d8bbfe Add first version of backpropagation support.
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.

Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
2024-06-04 08:37:49 +00:00
Paul Chang 5feacf120c static_assert shape constraints in MatMul 4x4
PiperOrigin-RevId: 639069345
2024-05-31 10:02:45 -07:00
Phil Culliton c616abe628 Unrolled / tiled 4x4 MatMul
PiperOrigin-RevId: 638384686
2024-05-29 13:02:35 -07:00
Zoltan Szabadka 542ad0973a Fix normalization in Softmax function. 2024-05-24 08:58:31 +00:00
Apoorv Reddy 1aaf3b3aae Documenting the RoPE implementation.
PiperOrigin-RevId: 636175297
2024-05-22 08:26:29 -07:00
Jan Wassenberg 22fe9809ac Fix SVE build: add missing hn::
PiperOrigin-RevId: 632481097
2024-05-10 06:49:26 -07:00
Jan Wassenberg c5c9fc300c Enable even/odd for SFP. Refs #166
Disable it for float32 because there is not enough benefit.

PiperOrigin-RevId: 631788326
2024-05-08 07:09:06 -07:00
Jan Wassenberg f6d02b2870 Fix RecurrentGemma (refs #166) - one Dot was ignoring scale.
Remove extra Dot() overload
MatVecAdd always adds, use MatVecT<kAdd> if conditional.
Remove ununsed MatVecAddLoop and MatVecLoop
No longer tsan-verify even_odd

PiperOrigin-RevId: 631377279
2024-05-07 04:40:42 -07:00
Phil Culliton 28ca001d5e Matmul and test functions
PiperOrigin-RevId: 630373984
2024-05-03 06:39:36 -07:00
Copybara-Service 6eeef2e2d9 Merge pull request #166 from samkaufman:deinterleave-vecs
PiperOrigin-RevId: 630360778
2024-05-03 05:23:31 -07:00
Zoltan Szabadka 9a2682d544 Use more parallelism in the QKV projections of the MHA block.
We compute all three projections with one MatVec and then copy
the kv part to the cache.

Benchmark results for 7b-it model that uses MHA blocks (summarization with
1600 tokens for prefill and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               13.75 t/s    14.80 t/s       9.22 t/s     9.77 t/s
64               19.89 t/s    24.83 t/s      12.46 t/s    13.66 t/s
```
2024-05-02 13:46:45 +00:00
Sam Kaufman 4a6173d929 Remove unused vars. 2024-05-02 00:41:44 -07:00
Sam Kaufman 564937ede6 Merge branch 'dev' into deinterleave-vecs 2024-04-30 16:23:04 -07:00
Sam Kaufman 2829ef17ad Check for HWY_NATIVE_DOT_BF16. 2024-04-30 15:19:28 -07:00
Sam Kaufman 59ebecce22 Fix: specialized MatVecAdd was never called. 2024-04-30 15:17:27 -07:00
Jan Wassenberg 12fb2f05cf Add per-thread even_odd storage for #166.
Also inline ProjQ and ProjKV lambdas,
add missing includes/deps for ops_test.

PiperOrigin-RevId: 629460608
2024-04-30 10:42:23 -07:00
Sam Kaufman 6a78a23f4c Abstracted some MatVecAdd spec. dupes. 2024-04-29 16:23:38 -07:00
Sam Kaufman f608337fef Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo. 2024-04-29 14:13:07 -07:00
Sam Kaufman aa0b113214 (VecT*) to static_cast<VecT*>. 2024-04-29 12:53:47 -07:00
Sam Kaufman 5cb63346aa supports_eo -> kSupportsEvenOdd 2024-04-29 12:51:35 -07:00
Sam Kaufman 0816a1070d Even-odd layout MatVecs for bf16 weights. 2024-04-28 20:09:25 -07:00
Jan Wassenberg a982ec1287 Move code to gemma/ so we can remove error-prone copybara: comments.
Also fix includes and Lint warnings.

PiperOrigin-RevId: 623127487
2024-04-09 04:45:42 -07:00
Renamed from ops.h (Browse further)