Commit Graph

425 Commits

Author SHA1 Message Date
Jan Wassenberg f9b390b134 Support all weight types in a single binary.
This changes the command line flags, but the default value retains the previous behavior.

Also add a CreateGemma helper to enable extra args without interface changes.

PiperOrigin-RevId: 641266411
2024-06-07 09:04:45 -07:00
Copybara-Service 24db2ff725 Merge pull request #217 from szabadka:cross-entropy
PiperOrigin-RevId: 641241133
2024-06-07 07:17:35 -07:00
Daniel Keysers 06f814fc8b Small code cleanup suggestions while reading the code.
PiperOrigin-RevId: 641220788
2024-06-07 05:33:17 -07:00
Zoltan Szabadka 465998d25a Add support for custom sampling function to runtime config.
With this addition the ComputeCrossEntropy function can be moved
to its own library, because now we can compute it using only the
public API functions from gemma.h
2024-06-07 11:45:07 +00:00
Copybara-Service f7ac7092d6 Merge pull request #212 from szabadka:adam2
PiperOrigin-RevId: 641182573
2024-06-07 02:25:18 -07:00
Zoltan Szabadka c004799cdc Add Adam optimizer.
Drive-by: Fix compilation errors and tests for backprop functions.
2024-06-06 18:41:36 +00:00
Jan Wassenberg 12707ade80 Toward only using compressed weights:
CompressedLayer should all be f32 when weights are f32.

PiperOrigin-RevId: 640954519
2024-06-06 11:00:23 -07:00
Paul Chang 6c0be20fa6 Fix Softmax on SVE
PiperOrigin-RevId: 640947138
2024-06-06 10:39:30 -07:00
The gemma.cpp Authors 39d4115717 Implement mixed mode matmul: f32 * bf16
PiperOrigin-RevId: 640940962
2024-06-06 10:21:46 -07:00
Jan Wassenberg 57c2cd8b52 Simplifications: remove GemmaInterface and GemmaImpl
Split common and weights into separate lib
Remove common-inl (does not have to be SIMD code), activations.cc
Centralize switch(Model) to avoid duplication
Move CompressWeightsT to compress_weights.cc
Move LoadWeights to weights.cc

PiperOrigin-RevId: 640869202
2024-06-06 05:54:21 -07:00
Jan Wassenberg 5c3e5f7038 Remove no longer required stats.h - use Highway version instead
PiperOrigin-RevId: 640440379
2024-06-05 01:37:48 -07:00
Paul Chang 175e389c3c revert back to HWY_ASSERT for lane constraints, qualify hn::Add
PiperOrigin-RevId: 640193239
2024-06-04 10:10:18 -07:00
Phil Culliton e71d82ead9 Fix for GenerateZeroMat call in TestTiledMatMul
PiperOrigin-RevId: 640180868
2024-06-04 09:32:23 -07:00
Zelalem Aweke 9e213b3d96 Use system topology to pin threads across clusters.
PiperOrigin-RevId: 640151974
2024-06-04 07:50:32 -07:00
Jan Wassenberg 4f9155d8c6 Add bf16 matmul support, update naming+test
Avoid int32, which can easily overflow for large matrices.
Also fix IDE warning in sfp-inl.

PiperOrigin-RevId: 640149845
2024-06-04 07:41:46 -07:00
Zoltan Szabadka df01700b54 Move the backpropagation code to its own directory 2024-06-04 10:20:16 +00:00
Zoltan Szabadka 3b4fa4a0e3 Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible. 2024-06-04 09:18:56 +00:00
Zoltan Szabadka 8567978541 Adress review comments 2024-06-04 08:37:54 +00:00
Zoltan Szabadka 7e639856da Fix compilation and tests for gcc 2024-06-04 08:37:54 +00:00
Zoltan Szabadka 36e4d8bbfe Add first version of backpropagation support.
This is still in progress / experimental, currently it is only
implemented for normal gemma MQA attention layers, and no
parallelism is added yet for backward pass.

Since we need to remember all activations from all layers, the
forward pass was also reimplemented with a new activation data
structure.
2024-06-04 08:37:49 +00:00
Paul Chang ed8f39c058 Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T
PiperOrigin-RevId: 639793810
2024-06-03 08:32:29 -07:00
Jan Wassenberg a44cbdadc2 Update to Highway 1.2 for topology/VQSelect
Also fix unused-warning in compress-inl.

PiperOrigin-RevId: 639116915
2024-05-31 12:29:10 -07:00
Paul Chang 5feacf120c static_assert shape constraints in MatMul 4x4
PiperOrigin-RevId: 639069345
2024-05-31 10:02:45 -07:00
Phil Culliton c616abe628 Unrolled / tiled 4x4 MatMul
PiperOrigin-RevId: 638384686
2024-05-29 13:02:35 -07:00
Paul Chang 419dc34ed5 Generic MHA/MQA/GQA implementation
PiperOrigin-RevId: 636937885
2024-05-24 09:05:53 -07:00
Zoltan Szabadka 542ad0973a Fix normalization in Softmax function. 2024-05-24 08:58:31 +00:00
Apoorv Reddy 1aaf3b3aae Documenting the RoPE implementation.
PiperOrigin-RevId: 636175297
2024-05-22 08:26:29 -07:00
Apoorv Reddy 7f4b85d00b Add MMLU eval to github
PiperOrigin-RevId: 635495178
2024-05-20 10:20:53 -07:00
Paul Chang 82623bdc7f Refer to --weights rather than --compressed_weights to simplify CLI docs
PiperOrigin-RevId: 634391135
2024-05-16 07:51:49 -07:00
Apoorv Reddy 8e641eb4cd Add TTFT to TimingInfo
PiperOrigin-RevId: 634378994
2024-05-16 07:16:53 -07:00
Apoorv Reddy eb0b96e0a8 Pass most runtime parameters using const RuntimeConfig&
PiperOrigin-RevId: 633572507
2024-05-14 07:04:53 -07:00
Apoorv Reddy f1eab987d8 Store tokens/sec in auxiliary struct TimingInfo.
PiperOrigin-RevId: 633108908
2024-05-13 00:04:19 -07:00
Jan Wassenberg 22fe9809ac Fix SVE build: add missing hn::
PiperOrigin-RevId: 632481097
2024-05-10 06:49:26 -07:00
Jan Wassenberg c5c9fc300c Enable even/odd for SFP. Refs #166
Disable it for float32 because there is not enough benefit.

PiperOrigin-RevId: 631788326
2024-05-08 07:09:06 -07:00
Paul Chang bacba351d4 Support additional scaling
PiperOrigin-RevId: 631429113
2024-05-07 08:16:25 -07:00
Jan Wassenberg f6d02b2870 Fix RecurrentGemma (refs #166) - one Dot was ignoring scale.
Remove extra Dot() overload
MatVecAdd always adds, use MatVecT<kAdd> if conditional.
Remove ununsed MatVecAddLoop and MatVecLoop
No longer tsan-verify even_odd

PiperOrigin-RevId: 631377279
2024-05-07 04:40:42 -07:00
Copybara-Service 8ed22e52bf Merge pull request #177 from szabadka:gemma2
PiperOrigin-RevId: 630388843
2024-05-03 07:52:27 -07:00
Zoltan Szabadka 19017fdb6d Fix expression in DASSERT() 2024-05-03 13:54:20 +00:00
Phil Culliton 28ca001d5e Matmul and test functions
PiperOrigin-RevId: 630373984
2024-05-03 06:39:36 -07:00
Zoltan Szabadka 429eb78512 Remove unused vars. 2024-05-03 13:37:17 +00:00
Zoltan Szabadka 3d72f17261 Use more parallelism in attention block in prefill mode.
Move the loop over the tokens inside the attention block and
then create kHeads * num_tokens threads.

This helps the multi-threaded speed only in case of the 2b gemma
model, but to be consistent we move the loop over the tokens inside
the griffin recurrent layer and the FFW layer as well. This is
also a preparation for using the MatMul operation later.

Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):

```
                   Prefill speed
Num threads      BEFORE       AFTER
32               61.76 t/s    65.08 t/s
64               89.46 t/s    98.62 t/s
```
2024-05-03 13:23:07 +00:00
Copybara-Service 6eeef2e2d9 Merge pull request #166 from samkaufman:deinterleave-vecs
PiperOrigin-RevId: 630360778
2024-05-03 05:23:31 -07:00
Zoltan Szabadka 9a2682d544 Use more parallelism in the QKV projections of the MHA block.
We compute all three projections with one MatVec and then copy
the kv part to the cache.

Benchmark results for 7b-it model that uses MHA blocks (summarization with
1600 tokens for prefill and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               13.75 t/s    14.80 t/s       9.22 t/s     9.77 t/s
64               19.89 t/s    24.83 t/s      12.46 t/s    13.66 t/s
```
2024-05-02 13:46:45 +00:00
Zoltan Szabadka 0afa480d90 Use more parallelism in the final output of the attention block.
We use MatVec instead of MatVecLoop for the per-head dense layers,
because we can parallelize more on the rows of the matrix than
on the number of heads. This will be even more efficient after
we rearrange the weights and can have a single MatVec operation.

Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
32               58.24 t/s    61.79 t/s      32.11 t/s    32.62 t/s
64               83.62 t/s    92.00 t/s      41.10 t/s    41.80 t/s
```
2024-05-02 09:30:07 +00:00
Sam Kaufman 4a6173d929 Remove unused vars. 2024-05-02 00:41:44 -07:00
Sam Kaufman 564937ede6 Merge branch 'dev' into deinterleave-vecs 2024-04-30 16:23:04 -07:00
Sam Kaufman 2829ef17ad Check for HWY_NATIVE_DOT_BF16. 2024-04-30 15:19:28 -07:00
Sam Kaufman 59ebecce22 Fix: specialized MatVecAdd was never called. 2024-04-30 15:17:27 -07:00
Jan Wassenberg 12fb2f05cf Add per-thread even_odd storage for #166.
Also inline ProjQ and ProjKV lambdas,
add missing includes/deps for ops_test.

PiperOrigin-RevId: 629460608
2024-04-30 10:42:23 -07:00
Zoltan Szabadka f8ccb8e37c Fix kv offset computation for MHA config. 2024-04-30 16:19:14 +00:00
Zoltan Szabadka afaca4efa8 Use more parallelism in the QKV projections in MQA mode.
Instead of MatVecLoop, we use MatVec and we combine k and v
into one 2 * kQKVDim long vector so that K and V projections
can be combined into one MatVec operation.

Benchmark results (summarization with 1600 tokens for prefill
and essay writing with 500 tokens for generation):

```
                   Prefill speed                Generation speed
Num threads      BEFORE       AFTER            BEFORE       AFTER
4                 9.81 t/s     9.96 t/s       8.39 t/s     8.46 t/s
18               31.50 t/s    36.67 t/s      23.10 t/s    25.83 t/s
32               45.36 t/s    58.91 t/s      27.60 t/s    31.25 t/s
64               57.72 t/s    80.64 t/s      35.40 t/s    39.76 t/s
```
2024-04-30 13:10:14 +00:00
Sam Kaufman 6a78a23f4c Abstracted some MatVecAdd spec. dupes. 2024-04-29 16:23:38 -07:00
Sam Kaufman f608337fef Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo. 2024-04-29 14:13:07 -07:00
Sam Kaufman aa0b113214 (VecT*) to static_cast<VecT*>. 2024-04-29 12:53:47 -07:00
Sam Kaufman 5cb63346aa supports_eo -> kSupportsEvenOdd 2024-04-29 12:51:35 -07:00
Zoltan Szabadka 27117cc39f Simplify threading: remove the use of inner_pool.
We only used inner_pool in the prefill FFW function, and there we
can achieve sufficient parallelism on the rows of the matrix-vector
multiplications.

Benchmark results on a 1600-token summarization task:

```
               Prefill speed
Num threads    BEFORE         AFTER
4               9.24 t/s       9.76 t/s
18             31.41 t/s      31.16 t/s
32             31.41 t/s      45.13 t/s
64             31.03 t/s      57.85 t/s
```
2024-04-29 16:07:30 +00:00
Paul Chang 1d18c5a129 Improve documentation for compress_weights flags
PiperOrigin-RevId: 629053191
2024-04-29 06:49:50 -07:00
Sam Kaufman 0816a1070d Even-odd layout MatVecs for bf16 weights. 2024-04-28 20:09:25 -07:00
Paul Chang 2d4de6b08b Support absolute positional embeddings from vanilla transformer
PiperOrigin-RevId: 628100831
2024-04-25 09:32:14 -07:00
Paul Chang 75eca87039 Simplify prefill early-exit (originally Merge #156)
PiperOrigin-RevId: 627788524
2024-04-24 11:11:42 -07:00
Charles Chan ea45d7c4d7 Use lambda to split function and Make stream_token can break prefill, too 2024-04-23 22:55:01 +08:00
Paul Chang e8d29792ac New token validity assertions, improve prompt truncation warning
PiperOrigin-RevId: 627376194
2024-04-23 07:05:59 -07:00
Jan Wassenberg 3bf22abb22 Fix sign comparison warnings
PiperOrigin-RevId: 627299902
2024-04-23 01:16:51 -07:00
Jan Wassenberg e9a0caed87 Further improve IO, enable multiple backends without -D.
Move Path into io.h and use for opening files.
Removes dependency of gemma_lib on args.
Separate Windows codepath instead of emulating POSIX functions.

Plus lint fixes.

PiperOrigin-RevId: 626279004
2024-04-19 00:40:29 -07:00
Paul Chang 38f1ea9b80 Eliminate redundant copies of TokenString()
Move this function outside of HWY_NAMESPACE since it doesn't need to be
optimized for any particular architecture.

PiperOrigin-RevId: 626098641
2024-04-18 11:31:50 -07:00
Jan Wassenberg a8ceb75f43 Improved IO abstraction layer
Move to unique_ptr-like File class.
Move `if OS_WIN` into wrapper functions.
exists -> Exists.

PiperOrigin-RevId: 625923056
2024-04-17 23:15:07 -07:00
Andrey Mikhaylov 4ef3da733a Fixed minor things and added comments. 2024-04-12 15:39:16 +00:00
Andrey Mikhaylov 2c5706f159 Add comments regarding layers output usage. 2024-04-12 15:39:16 +00:00
Andrey Mikhaylov 03284d752e Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file. 2024-04-12 15:39:16 +00:00
RangerUFO e541707caa Rename the fields of Griffin weights 2024-04-10 21:04:31 +08:00
RangerUFO 4e960d67f6 Fix typos 2024-04-10 20:38:18 +08:00
RangerUFO 809bd0709d Refactor data structures to reduce memory usage 2024-04-10 19:35:23 +08:00
Jan Wassenberg 881eeffe0a Lint fixes: strcat, includes, arg naming
PiperOrigin-RevId: 623435210
2024-04-10 03:12:41 -07:00
RangerUFO 2099b37732 Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs 2024-04-09 20:44:41 +08:00
Jan Wassenberg a982ec1287 Move code to gemma/ so we can remove error-prone copybara: comments.
Also fix includes and Lint warnings.

PiperOrigin-RevId: 623127487
2024-04-09 04:45:42 -07:00