Andrey Vlasov
38eb452b94
Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32,
...
sfp) tiled MatMul.
PiperOrigin-RevId: 642901844
2024-06-13 02:07:21 -07:00
Daniel Keysers
6e67a6d8a9
Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding.
...
PiperOrigin-RevId: 642614278
2024-06-12 07:52:13 -07:00
Daniel Keysers
1ac9857014
Extends Transformer() to prepare for batched processing.
...
PiperOrigin-RevId: 642603025
2024-06-12 07:01:03 -07:00
The gemma.cpp Authors
2a0e6ee976
Fix numerical issue in Softcap by subtracting max.
...
Also update test threshold.
PiperOrigin-RevId: 642587468
2024-06-12 05:42:16 -07:00
The gemma.cpp Authors
f467670de7
Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix.
...
PiperOrigin-RevId: 642533996
2024-06-12 01:11:59 -07:00
Ray Smith
bdf33c7008
Updated benchmarks.cc to recent changes to Gemma API.
...
PiperOrigin-RevId: 642285902
2024-06-11 08:55:40 -07:00
Phil Culliton
b6565e3bf6
Update AssertClose for large matrices and add large matrix test
...
PiperOrigin-RevId: 642277221
2024-06-11 08:22:47 -07:00
Jan Wassenberg
3e2396f98c
Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc
...
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
2024-06-11 05:53:10 -07:00
Daniel Keysers
c557ad23a8
Adds simple-loop versions of missing batched functions.
...
PiperOrigin-RevId: 642189741
2024-06-11 02:14:02 -07:00
Jan Wassenberg
c7f5e93136
Update benchmark with internal init
...
PiperOrigin-RevId: 641929308
2024-06-10 09:35:16 -07:00
Copybara-Service
49d814b519
Merge pull request #224 from szabadka:cleanup
...
PiperOrigin-RevId: 641922102
2024-06-10 09:11:13 -07:00
Jan Wassenberg
c1c6714ad4
Internal experiment
...
PiperOrigin-RevId: 641915024
2024-06-10 08:46:10 -07:00
Zoltan Szabadka
a3a75b77f9
Use CompressedWeights<TConfig<float>> in backpropagation.
...
kWeightsAreCompressed are removed and LoadRawWeights is moved
to compress_weights.cc
2024-06-10 14:34:24 +00:00
Phil Culliton
c5bcb5438c
Fix for transpose matrix creation and additional tests
...
PiperOrigin-RevId: 641868053
2024-06-10 05:24:04 -07:00
Jan Wassenberg
36e6915e18
Add CPU output, error if not C++17, simplify tokenizer ctor
...
PiperOrigin-RevId: 641850879
2024-06-10 04:01:11 -07:00
Phil Culliton
d985d8b867
Shifting large matrix init to heap in ops_test.cc
...
PiperOrigin-RevId: 641311100
2024-06-07 11:38:42 -07:00
Jan Wassenberg
f9b390b134
Support all weight types in a single binary.
...
This changes the command line flags, but the default value retains the previous behavior.
Also add a CreateGemma helper to enable extra args without interface changes.
PiperOrigin-RevId: 641266411
2024-06-07 09:04:45 -07:00
Copybara-Service
24db2ff725
Merge pull request #217 from szabadka:cross-entropy
...
PiperOrigin-RevId: 641241133
2024-06-07 07:17:35 -07:00
Daniel Keysers
06f814fc8b
Small code cleanup suggestions while reading the code.
...
PiperOrigin-RevId: 641220788
2024-06-07 05:33:17 -07:00
Zoltan Szabadka
465998d25a
Add support for custom sampling function to runtime config.
...
With this addition the ComputeCrossEntropy function can be moved
to its own library, because now we can compute it using only the
public API functions from gemma.h
2024-06-07 11:45:07 +00:00
Copybara-Service
f7ac7092d6
Merge pull request #212 from szabadka:adam2
...
PiperOrigin-RevId: 641182573
2024-06-07 02:25:18 -07:00
Zoltan Szabadka
c004799cdc
Add Adam optimizer.
...
Drive-by: Fix compilation errors and tests for backprop functions.
2024-06-06 18:41:36 +00:00
Jan Wassenberg
12707ade80
Toward only using compressed weights:
...
CompressedLayer should all be f32 when weights are f32.
PiperOrigin-RevId: 640954519
2024-06-06 11:00:23 -07:00
Paul Chang
6c0be20fa6
Fix Softmax on SVE
...
PiperOrigin-RevId: 640947138
2024-06-06 10:39:30 -07:00
The gemma.cpp Authors
39d4115717
Implement mixed mode matmul: f32 * bf16
...
PiperOrigin-RevId: 640940962
2024-06-06 10:21:46 -07:00
Jan Wassenberg
57c2cd8b52
Simplifications: remove GemmaInterface and GemmaImpl
...
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc
PiperOrigin-RevId: 640869202
2024-06-06 05:54:21 -07:00
Jan Wassenberg
5c3e5f7038
Remove no longer required stats.h - use Highway version instead
...
PiperOrigin-RevId: 640440379
2024-06-05 01:37:48 -07:00
Paul Chang
175e389c3c
revert back to HWY_ASSERT for lane constraints, qualify hn::Add
...
PiperOrigin-RevId: 640193239
2024-06-04 10:10:18 -07:00
Phil Culliton
e71d82ead9
Fix for GenerateZeroMat call in TestTiledMatMul
...
PiperOrigin-RevId: 640180868
2024-06-04 09:32:23 -07:00
Zelalem Aweke
9e213b3d96
Use system topology to pin threads across clusters.
...
PiperOrigin-RevId: 640151974
2024-06-04 07:50:32 -07:00
Jan Wassenberg
4f9155d8c6
Add bf16 matmul support, update naming+test
...
Avoid int32, which can easily overflow for large matrices.
Also fix IDE warning in sfp-inl.
PiperOrigin-RevId: 640149845
2024-06-04 07:41:46 -07:00
Zoltan Szabadka
df01700b54
Move the backpropagation code to its own directory
2024-06-04 10:20:16 +00:00
Zoltan Szabadka
3b4fa4a0e3
Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible.
2024-06-04 09:18:56 +00:00
Zoltan Szabadka
8567978541
Adress review comments
2024-06-04 08:37:54 +00:00
Zoltan Szabadka
7e639856da
Fix compilation and tests for gcc
2024-06-04 08:37:54 +00:00
Zoltan Szabadka
36e4d8bbfe
Add first version of backpropagation support.
...
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.
Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
2024-06-04 08:37:49 +00:00
Paul Chang
ed8f39c058
Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T
...
PiperOrigin-RevId: 639793810
2024-06-03 08:32:29 -07:00
Jan Wassenberg
a44cbdadc2
Update to Highway 1.2 for topology/VQSelect
...
Also fix unused-warning in compress-inl.
PiperOrigin-RevId: 639116915
2024-05-31 12:29:10 -07:00
Paul Chang
5feacf120c
static_assert shape constraints in MatMul 4x4
...
PiperOrigin-RevId: 639069345
2024-05-31 10:02:45 -07:00
Phil Culliton
c616abe628
Unrolled / tiled 4x4 MatMul
...
PiperOrigin-RevId: 638384686
2024-05-29 13:02:35 -07:00
Paul Chang
419dc34ed5
Generic MHA/MQA/GQA implementation
...
PiperOrigin-RevId: 636937885
2024-05-24 09:05:53 -07:00
Zoltan Szabadka
542ad0973a
Fix normalization in Softmax function.
2024-05-24 08:58:31 +00:00
Apoorv Reddy
1aaf3b3aae
Documenting the RoPE implementation.
...
PiperOrigin-RevId: 636175297
2024-05-22 08:26:29 -07:00
Apoorv Reddy
7f4b85d00b
Add MMLU eval to github
...
PiperOrigin-RevId: 635495178
2024-05-20 10:20:53 -07:00
Paul Chang
82623bdc7f
Refer to --weights rather than --compressed_weights to simplify CLI docs
...
PiperOrigin-RevId: 634391135
2024-05-16 07:51:49 -07:00
Apoorv Reddy
8e641eb4cd
Add TTFT to TimingInfo
...
PiperOrigin-RevId: 634378994
2024-05-16 07:16:53 -07:00
Apoorv Reddy
eb0b96e0a8
Pass most runtime parameters using const RuntimeConfig&
...
PiperOrigin-RevId: 633572507
2024-05-14 07:04:53 -07:00
Apoorv Reddy
f1eab987d8
Store tokens/sec in auxiliary struct TimingInfo.
...
PiperOrigin-RevId: 633108908
2024-05-13 00:04:19 -07:00
Jan Wassenberg
22fe9809ac
Fix SVE build: add missing hn::
...
PiperOrigin-RevId: 632481097
2024-05-10 06:49:26 -07:00
Jan Wassenberg
c5c9fc300c
Enable even/odd for SFP. Refs #166
...
Disable it for float32 because there is not enough benefit.
PiperOrigin-RevId: 631788326
2024-05-08 07:09:06 -07:00
Paul Chang
bacba351d4
Support additional scaling
...
PiperOrigin-RevId: 631429113
2024-05-07 08:16:25 -07:00
Jan Wassenberg
f6d02b2870
Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale.
...
Remove extra Dot() overload
MatVecAdd always adds, use MatVecT<kAdd> if conditional.
Remove ununsed MatVecAddLoop and MatVecLoop
No longer tsan-verify even_odd
PiperOrigin-RevId: 631377279
2024-05-07 04:40:42 -07:00
Copybara-Service
8ed22e52bf
Merge pull request #177 from szabadka:gemma2
...
PiperOrigin-RevId: 630388843
2024-05-03 07:52:27 -07:00
Zoltan Szabadka
19017fdb6d
Fix expression in DASSERT()
2024-05-03 13:54:20 +00:00
Phil Culliton
28ca001d5e
Matmul and test functions
...
PiperOrigin-RevId: 630373984
2024-05-03 06:39:36 -07:00
Zoltan Szabadka
429eb78512
Remove unused vars.
2024-05-03 13:37:17 +00:00
Zoltan Szabadka
3d72f17261
Use more parallelism in attention block in prefill mode.
...
Move the loop over the tokens inside the attention block and
then create kHeads * num_tokens threads.
This helps the multi-threaded speed only in case of the 2b gemma
model, but to be consistent we move the loop over the tokens inside
the griffin recurrent layer and the FFW layer as well. This is
also a preparation for using the MatMul operation later.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed
Num threads BEFORE AFTER
32 61.76 t/s 65.08 t/s
64 89.46 t/s 98.62 t/s
```
2024-05-03 13:23:07 +00:00
Copybara-Service
6eeef2e2d9
Merge pull request #166 from samkaufman:deinterleave-vecs
...
PiperOrigin-RevId: 630360778
2024-05-03 05:23:31 -07:00
Zoltan Szabadka
9a2682d544
Use more parallelism in the QKV projections of the MHA block.
...
We compute all three projections with one MatVec and then copy
the kv part to the cache.
Benchmark results for 7b-it model that uses MHA blocks (summarization with
1600 tokens for prefill and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s
64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s
```
2024-05-02 13:46:45 +00:00
Zoltan Szabadka
0afa480d90
Use more parallelism in the final output of the attention block.
...
We use MatVec instead of MatVecLoop for the per-head dense layers,
because we can parallelize more on the rows of the matrix than
on the number of heads. This will be even more efficient after
we rearrange the weights and can have a single MatVec operation.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s
64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s
```
2024-05-02 09:30:07 +00:00
Sam Kaufman
4a6173d929
Remove unused vars.
2024-05-02 00:41:44 -07:00
Sam Kaufman
564937ede6
Merge branch 'dev' into deinterleave-vecs
2024-04-30 16:23:04 -07:00
Sam Kaufman
2829ef17ad
Check for HWY_NATIVE_DOT_BF16.
2024-04-30 15:19:28 -07:00
Sam Kaufman
59ebecce22
Fix: specialized MatVecAdd was never called.
2024-04-30 15:17:27 -07:00
Jan Wassenberg
12fb2f05cf
Add per-thread even_odd storage for #166 .
...
Also inline ProjQ and ProjKV lambdas,
add missing includes/deps for ops_test.
PiperOrigin-RevId: 629460608
2024-04-30 10:42:23 -07:00
Zoltan Szabadka
f8ccb8e37c
Fix kv offset computation for MHA config.
2024-04-30 16:19:14 +00:00
Zoltan Szabadka
afaca4efa8
Use more parallelism in the QKV projections in MQA mode.
...
Instead of MatVecLoop, we use MatVec and we combine k and v
into one 2 * kQKVDim long vector so that K and V projections
can be combined into one MatVec operation.
Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):
```
Prefill speed Generation speed
Num threads BEFORE AFTER BEFORE AFTER
4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s
18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s
32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s
64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s
```
2024-04-30 13:10:14 +00:00
Sam Kaufman
6a78a23f4c
Abstracted some MatVecAdd spec. dupes.
2024-04-29 16:23:38 -07:00
Sam Kaufman
f608337fef
Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo.
2024-04-29 14:13:07 -07:00
Sam Kaufman
aa0b113214
(VecT*) to static_cast<VecT*>.
2024-04-29 12:53:47 -07:00
Sam Kaufman
5cb63346aa
supports_eo -> kSupportsEvenOdd
2024-04-29 12:51:35 -07:00
Zoltan Szabadka
27117cc39f
Simplify threading: remove the use of inner_pool.
...
We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.
Benchmark results on a 1600-token summarization task:
```
Prefill speed
Num threads BEFORE AFTER
4 9.24 t/s 9.76 t/s
18 31.41 t/s 31.16 t/s
32 31.41 t/s 45.13 t/s
64 31.03 t/s 57.85 t/s
```
2024-04-29 16:07:30 +00:00
Paul Chang
1d18c5a129
Improve documentation for compress_weights flags
...
PiperOrigin-RevId: 629053191
2024-04-29 06:49:50 -07:00
Sam Kaufman
0816a1070d
Even-odd layout MatVecs for bf16 weights.
2024-04-28 20:09:25 -07:00
Paul Chang
2d4de6b08b
Support absolute positional embeddings from vanilla transformer
...
PiperOrigin-RevId: 628100831
2024-04-25 09:32:14 -07:00
Paul Chang
75eca87039
Simplify prefill early-exit (originally Merge #156 )
...
PiperOrigin-RevId: 627788524
2024-04-24 11:11:42 -07:00
Charles Chan
ea45d7c4d7
Use lambda to split function and Make stream_token can break prefill, too
2024-04-23 22:55:01 +08:00
Paul Chang
e8d29792ac
New token validity assertions, improve prompt truncation warning
...
PiperOrigin-RevId: 627376194
2024-04-23 07:05:59 -07:00
Jan Wassenberg
3bf22abb22
Fix sign comparison warnings
...
PiperOrigin-RevId: 627299902
2024-04-23 01:16:51 -07:00
Jan Wassenberg
e9a0caed87
Further improve IO, enable multiple backends without -D.
...
Move Path into io.h and use for opening files.
Removes dependency of gemma_lib on args.
Separate Windows codepath instead of emulating POSIX functions.
Plus lint fixes.
PiperOrigin-RevId: 626279004
2024-04-19 00:40:29 -07:00
Paul Chang
38f1ea9b80
Eliminate redundant copies of TokenString()
...
Move this function outside of HWY_NAMESPACE since it doesn't need to be
optimized for any particular architecture.
PiperOrigin-RevId: 626098641
2024-04-18 11:31:50 -07:00
Jan Wassenberg
a8ceb75f43
Improved IO abstraction layer
...
Move to unique_ptr-like File class.
Move `if OS_WIN` into wrapper functions.
exists -> Exists.
PiperOrigin-RevId: 625923056
2024-04-17 23:15:07 -07:00
Andrey Mikhaylov
4ef3da733a
Fixed minor things and added comments.
2024-04-12 15:39:16 +00:00
Andrey Mikhaylov
2c5706f159
Add comments regarding layers output usage.
2024-04-12 15:39:16 +00:00
Andrey Mikhaylov
03284d752e
Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file.
2024-04-12 15:39:16 +00:00
RangerUFO
e541707caa
Rename the fields of Griffin weights
2024-04-10 21:04:31 +08:00
RangerUFO
4e960d67f6
Fix typos
2024-04-10 20:38:18 +08:00
RangerUFO
809bd0709d
Refactor data structures to reduce memory usage
2024-04-10 19:35:23 +08:00
Jan Wassenberg
881eeffe0a
Lint fixes: strcat, includes, arg naming
...
PiperOrigin-RevId: 623435210
2024-04-10 03:12:41 -07:00
RangerUFO
2099b37732
Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs
2024-04-09 20:44:41 +08:00
Jan Wassenberg
a982ec1287
Move code to gemma/ so we can remove error-prone copybara: comments.
...
Also fix includes and Lint warnings.
PiperOrigin-RevId: 623127487
2024-04-09 04:45:42 -07:00