Commit Graph

219 Commits

Author SHA1 Message Date
Jan Wassenberg b831fa8482 1.3x prefill, 0.95x decode: matmul replacing last matvec
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW                                 :      15414 x         4692352 = 24.166318
Gen.Attention.SumHeads                  :      15414 x         1394804 =  7.183451 !!
Gen.Embedding                           :        361 x        49961894 =  6.026297
Gen.Attention.QKV                       :      15414 x         1005125 =  5.176546
Gen.Attention.DotSoftmax                :      15414 x          885480 =  4.560357
RopeAndMulBy                            :     696528 x           11867 =  2.761818
```

After 49.80, 8.68
```
Gen.FFW                                 :      14448 x         5312783 = 25.646868
Gen.Embedding                           :        338 x        63044815 =  7.119845
Gen.Attention.QKV                       :      14448 x         1115003 =  5.382557
Gen.Attention.DotSoftmax                :      14448 x          897577 =  4.332957
RopeAndMulBy                            :     673344 x           11886 =  2.674156
Gen.Attention.SumHeads                  :      14448 x          518291 =  2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Jan Wassenberg 282f73ec2f Add pin flag to disable pinning. Refs #338
PiperOrigin-RevId: 661389171
2024-08-09 13:47:12 -07:00
Apoorv Reddy fd1b0743a7 Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B.
This is to make it clear that these models are part of the Gemma2 family of models.

PiperOrigin-RevId: 661181682
2024-08-09 02:09:06 -07:00
Jan Wassenberg 2ebbe4076f 1.03-1.08x decode speedup: precompute Rope theta, fuse
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.

PiperOrigin-RevId: 661168418
2024-08-09 01:23:24 -07:00
The gemma.cpp Authors 27258b03e6 Improve performance logging
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg 5e433e774a 1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.

PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Phil Culliton 1982a6ba00 Internal change
PiperOrigin-RevId: 657831926
2024-07-30 20:24:54 -07:00
Jan Wassenberg a24eda8d02 Split matmul into matvec; add large matrix benchmark
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.

PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang d37c088e44 Extend LayersOutputFunc to take query index and auxillary int
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg 8b4915f321 Fix Windows build - macro conflict with param name
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg 6ea4232b2e MatMul cleanup: Mat struct, simplify args.
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.

PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Jan Wassenberg f27683152c 1.05x prefill speedup: matvec -> matmul for !MHA
Also add C_stride and make shape normal non-template arguments.

PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg 2721f54446 Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg aaf51898b6 Major revamp #2 of Prefill: fix token order, parallel for multi-query
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.

PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Daniel Keysers 2346b5a434 Minor polishing: adding comments, renaming variables.
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers 33334ad454 Fix msan uninitialized scale in optimize_test
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
Jan Wassenberg 85cac13fb1 Split up ops.h into ops/ops-inl and matmul-inl
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg 5844e6a1e5 Cleanup: add wrapper functions and rename vars to interleaved
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.

PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg 12016d31c3 Major Prefill/Generate cleanup, 1.3x Prefill speedup
This fixes TTFT, which was not including prefill.

PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg 3fe79b3876 Fix msan uninitialized scale
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers 5a751a9a44 Update gemma-27b to the correct query scaling.
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers ff34370aac Simplify FFW by using MatMul_4x4_Batch_Add.
Affects only the griffin model, where prefill TPS improves by about 70%.

PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Jan Wassenberg cd530374b3 Further 1.02x prefill speedup from batch 64->512
Measured on SKX. Larger speedup expected for Zen4/SPR.

PiperOrigin-RevId: 652472928
2024-07-15 07:26:10 -07:00
The gemma.cpp Authors c879133a5a Increase the prefill batch size to 64.
PiperOrigin-RevId: 651754772
2024-07-12 06:28:37 -07:00
The gemma.cpp Authors df3fb70802 Improve readability with RepeatedAttentionWindowSizes
PiperOrigin-RevId: 651431738
2024-07-11 09:11:46 -07:00
Jan Wassenberg edaf61b983 SVE build fix: avoid capturing vectors directly.
Also use more V typedef instead of auto.

PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg be765afce2 Simplify matmul: only 2 overloads
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.

1.02-1.06x speedup.

PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov 3e92088595 Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.

Measurements for a 2b-it sfp-encoded model on a  AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00
Kan Wu f519ab6693 Refactor configurables.
PiperOrigin-RevId: 651259154
2024-07-10 21:30:58 -07:00
Andrey Vlasov 960ff4b4ec Record time measurements in MatMul tests.
PiperOrigin-RevId: 651060711
2024-07-10 10:04:40 -07:00
Daniel Keysers 063bbaa683 Add more comments to attention computation (and some small restructuring).
PiperOrigin-RevId: 650929097
2024-07-10 02:39:07 -07:00
Jan Wassenberg 6a3f7cf3ea Lint fix - string append, remove stale TODO
PiperOrigin-RevId: 650197468
2024-07-08 04:11:21 -07:00
Jan Wassenberg cbb67b4ee0 Move benchmark_helper to evals/, weights_raw to compression/.
PiperOrigin-RevId: 650155983
2024-07-08 01:13:23 -07:00
Jan Wassenberg 438b1bace2 Fix handling of %c and %q if eot_string. Fixes #283, thanks @ljcucc
PiperOrigin-RevId: 649651535
2024-07-05 07:54:00 -07:00
Jan Wassenberg 118e802b00 Fix gemma_test - moved to evals/.
PiperOrigin-RevId: 649338633
2024-07-04 02:04:05 -07:00
Jan Wassenberg c7c3daa624 7x compile time speedup: shard gemma.cc
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.

PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00
Daniel Keysers a40165dea2 Small cleanups. Fixes gemma_test build.
PiperOrigin-RevId: 649008524
2024-07-03 03:13:38 -07:00
Kan Wu 7e4b20455e Add sliding window attention for Gemma 2.
PiperOrigin-RevId: 648778253
2024-07-02 11:08:03 -07:00
Jan Wassenberg 09a7e75ead Prep for sharding gemma.cc: split into kv_cache, tokenizer.
Move activations.h to backprop/ to make space for another activations.h.

PiperOrigin-RevId: 648744500
2024-07-02 09:31:06 -07:00
Jan Wassenberg 85fcd3cd80 Cleanup: add ModelInfo struct, remove gcpp::
PiperOrigin-RevId: 648707763
2024-07-02 07:11:15 -07:00
Jan Wassenberg b1c1ec1d59 Use benchmark_helper in py bindings (adds BOS)
Also remove thread clamp (OK to be zero or large).

PiperOrigin-RevId: 648657155
2024-07-02 03:27:15 -07:00
Jan Wassenberg e527e7662e Remove unused kSystemPrompt
PiperOrigin-RevId: 648429567
2024-07-01 11:18:07 -07:00
Jan Wassenberg af8eb2fde3 Declutter gemma/ directory, move binaries to evals/ and util/.
PiperOrigin-RevId: 648400795
2024-07-01 09:51:04 -07:00
Jan Wassenberg e588a7f45d Add config for att/final cap, skip max-subtract. Fixes #278
Also update includes/deps for backprop/.

PiperOrigin-RevId: 648399222
2024-07-01 09:45:26 -07:00
The gemma.cpp Authors da7507e6f0 Add prompt batching to Gemma.cpp.
This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts.

PiperOrigin-RevId: 648367559
2024-07-01 07:51:31 -07:00
Paul Chang 8ac5d66575 Introduce new Gemma 9B and 27B configs
PiperOrigin-RevId: 647299080
2024-06-27 06:45:24 -07:00
Paul Chang 78e96fdc70 Refactor model type / training tables, simplify reverse mapping
PiperOrigin-RevId: 647069372
2024-06-26 13:59:14 -07:00
The gemma.cpp Authors 7fc8ddf825 Fix a clang tidy warning
PiperOrigin-RevId: 646498062
2024-06-25 09:02:59 -07:00
The gemma.cpp Authors 12089417b5 Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline.
PiperOrigin-RevId: 646014268
2024-06-24 02:00:34 -07:00
The gemma.cpp Authors 80b1347393 Skip the last RMSNormInplaceBatched in the Prefill phase.
That only modifies activations.x, but it is called with prefill_activations which are not used after the Prefill call.

PiperOrigin-RevId: 645391387
2024-06-21 08:04:22 -07:00
Copybara-Service 82f16087ba Merge pull request #266 from ufownl:bugfix/kvcache
PiperOrigin-RevId: 645329504
2024-06-21 03:06:52 -07:00
RangerUFO f7855251ea Fix compilation errors in clang
It will occur in `ubuntu-latest` of GitHub Actions.
2024-06-21 13:40:40 +08:00
RangerUFO d7787c8f6c Fix KV cache size calculation error 2024-06-21 13:06:26 +08:00
Daniel Keysers 0570972d43 Fixing two typos.
PiperOrigin-RevId: 645103198
2024-06-20 11:33:12 -07:00
The gemma.cpp Authors a85725614a Refactor kCachePosSize and kCacheLayerSize into separate functors.
PiperOrigin-RevId: 645048519
2024-06-20 08:52:08 -07:00
Jan Wassenberg 48ebba8b7a Code cleanup
- Simplify template arg list, enable deduction
- missing hn:: on " Lanes"
- 1.0f suffix
- move RMSNormBatched into ops.h
- static constexpr -> constexpr
- concrete type instead of LayerT, WeightArrayT
- inline GetWeights
- remove if (runtime_config.verbosity
- merge AllocatePrefill and AllocateDecode
- remove bf_ffw_hidden

PiperOrigin-RevId: 644931277
2024-06-20 01:10:24 -07:00
The gemma.cpp Authors 658fb3e506 Move test placeholder to a later pos.
PiperOrigin-RevId: 644808456
2024-06-19 13:24:10 -07:00
The gemma.cpp Authors 0e612d9a20 Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions.
PiperOrigin-RevId: 644455520
2024-06-18 11:24:56 -07:00
Paul Chang d7d9d14f0e Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly
For regular (non-SSM) Gemma models, kGriffinLayers is by definition always zero
and kGemmaLayers is just the number of layers.

PiperOrigin-RevId: 644384531
2024-06-18 07:52:52 -07:00
Jan Wassenberg 70506b0a62 Fix debug_prompt and other binaries (internal init)
PiperOrigin-RevId: 644367683
2024-06-18 06:48:59 -07:00
Jan Wassenberg 15135f5b3d Simplify Attention.
Shared kMHA, reuse from Activations,
inline Attn lambda, use QDim as the stride between successive Q.

PiperOrigin-RevId: 644343854
2024-06-18 05:08:12 -07:00
Jan Wassenberg 2ac47e4a06 Fix Py binding/run_example: use GemmaEnv
PiperOrigin-RevId: 644318962
2024-06-18 03:20:22 -07:00
Jan Wassenberg a07f60c9a1 1.15x 7b sfp prefill speedup: Matmul in attention
2b bf16:
prefill 114.456 -> 115.222
decode  16.8847 -> 16.9987

7b sfp:
prefill 18.8575 -> 21.7325
decode 5.68428 -> 5.79791

PiperOrigin-RevId: 644283676
2024-06-18 01:00:51 -07:00
Jan Wassenberg 704d936764 Further simplification to ForEachTensor, thanks I.K.
PiperOrigin-RevId: 643996210
2024-06-17 07:12:26 -07:00
Jan Wassenberg 7d0720675f Move raw_weights into separate header, used mainly by compress_weights.
Fix warnings in backprop/* (include)

PiperOrigin-RevId: 643983136
2024-06-17 06:17:02 -07:00
Jan Wassenberg ad790d89d1 Fix DASSERT - TiledBatch requires at least 2 vectors.
Also use shorthand for weight types.

PiperOrigin-RevId: 643958371
2024-06-17 04:29:01 -07:00
The gemma.cpp Authors 7dbfa44794 Refactor CompressedWeights.
PiperOrigin-RevId: 643934198
2024-06-17 02:54:54 -07:00
Ray Smith e0afdfa8fb Added bias vector addition to MatMul
PiperOrigin-RevId: 643385381
2024-06-14 10:25:16 -07:00
The gemma.cpp Authors 2228055bb8 Internal change.
PiperOrigin-RevId: 643330703
2024-06-14 06:53:41 -07:00
Jan Wassenberg 29c0c574e6 Integrate matmul into FFW: 4.3x prefill speedup
```
before, bf16:
27.2929 prefill tokens / sec
17.2114 tokens / sec

after, bf16
116.496 prefill tokens / sec
17.5391 tokens / sec
```

PiperOrigin-RevId: 643328437
2024-06-14 06:32:26 -07:00
Ray Smith 198326a682 Removed now redundant non-batch matmul
PiperOrigin-RevId: 643317187
2024-06-14 05:13:36 -07:00
Andrey Vlasov b17631c95f Implement a missing (bf16, f32) tiled MatMul kernel.
PiperOrigin-RevId: 643313676
2024-06-14 04:54:40 -07:00
Jan Wassenberg d3c6a45b59 Major duplicated code reduction in test/benchmarks
Helper functions to tokenize/wrap
Move LayersOutputFunc into RuntimeConfig
AcceptFunc passes the probability
Implement StringFromType using the parser, and verify results match

PiperOrigin-RevId: 643255119
2024-06-14 00:16:25 -07:00
Jan Wassenberg c15ff9529c Reduce duplication in Config* by inheriting no-SSM
PiperOrigin-RevId: 643030629
2024-06-13 09:48:56 -07:00
Ray Smith ea525da967 Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time.
PiperOrigin-RevId: 643017973
2024-06-13 09:05:40 -07:00
The gemma.cpp Authors 1b40619864 Increase parallelism in ops_test
PiperOrigin-RevId: 643013415
2024-06-13 08:50:41 -07:00
Andrey Vlasov 38eb452b94 Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32,
sfp) tiled MatMul.

PiperOrigin-RevId: 642901844
2024-06-13 02:07:21 -07:00
Daniel Keysers 6e67a6d8a9 Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding.
PiperOrigin-RevId: 642614278
2024-06-12 07:52:13 -07:00
Daniel Keysers 1ac9857014 Extends Transformer() to prepare for batched processing.
PiperOrigin-RevId: 642603025
2024-06-12 07:01:03 -07:00
The gemma.cpp Authors 2a0e6ee976 Fix numerical issue in Softcap by subtracting max.
Also update test threshold.

PiperOrigin-RevId: 642587468
2024-06-12 05:42:16 -07:00
The gemma.cpp Authors f467670de7 Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix.
PiperOrigin-RevId: 642533996
2024-06-12 01:11:59 -07:00
Ray Smith bdf33c7008 Updated benchmarks.cc to recent changes to Gemma API.
PiperOrigin-RevId: 642285902
2024-06-11 08:55:40 -07:00
Phil Culliton b6565e3bf6 Update AssertClose for large matrices and add large matrix test
PiperOrigin-RevId: 642277221
2024-06-11 08:22:47 -07:00
Jan Wassenberg 3e2396f98c Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc
accept_token: allow default, check if empty when using
allow mixing sample_func and stream_func, call the latter after the former
Also fix missing includes/deps.
PiperOrigin-RevId: 642240012
2024-06-11 05:53:10 -07:00
Daniel Keysers c557ad23a8 Adds simple-loop versions of missing batched functions.
PiperOrigin-RevId: 642189741
2024-06-11 02:14:02 -07:00
Jan Wassenberg c7f5e93136 Update benchmark with internal init
PiperOrigin-RevId: 641929308
2024-06-10 09:35:16 -07:00
Copybara-Service 49d814b519 Merge pull request #224 from szabadka:cleanup
PiperOrigin-RevId: 641922102
2024-06-10 09:11:13 -07:00
Jan Wassenberg c1c6714ad4 Internal experiment
PiperOrigin-RevId: 641915024
2024-06-10 08:46:10 -07:00
Zoltan Szabadka a3a75b77f9 Use CompressedWeights<TConfig<float>> in backpropagation.
kWeightsAreCompressed are removed and LoadRawWeights is moved
to compress_weights.cc
2024-06-10 14:34:24 +00:00
Phil Culliton c5bcb5438c Fix for transpose matrix creation and additional tests
PiperOrigin-RevId: 641868053
2024-06-10 05:24:04 -07:00
Jan Wassenberg 36e6915e18 Add CPU output, error if not C++17, simplify tokenizer ctor
PiperOrigin-RevId: 641850879
2024-06-10 04:01:11 -07:00
Phil Culliton d985d8b867 Shifting large matrix init to heap in ops_test.cc
PiperOrigin-RevId: 641311100
2024-06-07 11:38:42 -07:00
Jan Wassenberg f9b390b134 Support all weight types in a single binary.
This changes the command line flags, but the default value retains the previous behavior.

Also add a CreateGemma helper to enable extra args without interface changes.

PiperOrigin-RevId: 641266411
2024-06-07 09:04:45 -07:00
Copybara-Service 24db2ff725 Merge pull request #217 from szabadka:cross-entropy
PiperOrigin-RevId: 641241133
2024-06-07 07:17:35 -07:00
Daniel Keysers 06f814fc8b Small code cleanup suggestions while reading the code.
PiperOrigin-RevId: 641220788
2024-06-07 05:33:17 -07:00
Zoltan Szabadka 465998d25a Add support for custom sampling function to runtime config.
With this addition the ComputeCrossEntropy function can be moved
to its own library, because now we can compute it using only the
public API functions from gemma.h
2024-06-07 11:45:07 +00:00
Copybara-Service f7ac7092d6 Merge pull request #212 from szabadka:adam2
PiperOrigin-RevId: 641182573
2024-06-07 02:25:18 -07:00
Zoltan Szabadka c004799cdc Add Adam optimizer.
Drive-by: Fix compilation errors and tests for backprop functions.
2024-06-06 18:41:36 +00:00