Commit Graph

434 Commits

Author SHA1 Message Date
Jan Wassenberg 3fe79b3876 Fix msan uninitialized scale
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers 5a751a9a44 Update gemma-27b to the correct query scaling.
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers ff34370aac Simplify FFW by using MatMul_4x4_Batch_Add.
Affects only the griffin model, where prefill TPS improves by about 70%.

PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Paul Chang 48b900b1b9 Fix examples/hello_world for real.
PiperOrigin-RevId: 652509319
2024-07-15 09:38:52 -07:00
Jan Wassenberg cd530374b3 Further 1.02x prefill speedup from batch 64->512
Measured on SKX. Larger speedup expected for Zen4/SPR.

PiperOrigin-RevId: 652472928
2024-07-15 07:26:10 -07:00
Paul Chang aaee666a1d Fix gemma_cpp/examples/hello_world build.
Include Bazel build rules, too.

PiperOrigin-RevId: 652469406
2024-07-15 07:11:01 -07:00
The gemma.cpp Authors c879133a5a Increase the prefill batch size to 64.
PiperOrigin-RevId: 651754772
2024-07-12 06:28:37 -07:00
The gemma.cpp Authors df3fb70802 Improve readability with RepeatedAttentionWindowSizes
PiperOrigin-RevId: 651431738
2024-07-11 09:11:46 -07:00
Jan Wassenberg edaf61b983 SVE build fix: avoid capturing vectors directly.
Also use more V typedef instead of auto.

PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg be765afce2 Simplify matmul: only 2 overloads
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.

1.02-1.06x speedup.

PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov 3e92088595 Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.

Measurements for a 2b-it sfp-encoded model on a  AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00
Kan Wu f519ab6693 Refactor configurables.
PiperOrigin-RevId: 651259154
2024-07-10 21:30:58 -07:00
Andrey Vlasov 960ff4b4ec Record time measurements in MatMul tests.
PiperOrigin-RevId: 651060711
2024-07-10 10:04:40 -07:00
Jan Wassenberg ee6e017a77 Fix windows build: min conflict, unused VF
PiperOrigin-RevId: 650955138
2024-07-10 04:18:25 -07:00
Daniel Keysers 063bbaa683 Add more comments to attention computation (and some small restructuring).
PiperOrigin-RevId: 650929097
2024-07-10 02:39:07 -07:00
Daniel Keysers cf76f0a401 Update gemma_test to also pass for the v1.1. models.
Make it an error if the model cannot be loaded.

PiperOrigin-RevId: 650232602
2024-07-08 06:45:37 -07:00
Jan Wassenberg 6a3f7cf3ea Lint fix - string append, remove stale TODO
PiperOrigin-RevId: 650197468
2024-07-08 04:11:21 -07:00
Jan Wassenberg cbb67b4ee0 Move benchmark_helper to evals/, weights_raw to compression/.
PiperOrigin-RevId: 650155983
2024-07-08 01:13:23 -07:00
Daniel Keysers cdebcc3533 Update gemma_test with the expected entropy values for the IT models of size 2B/7B/9B/27B.
PiperOrigin-RevId: 649662047
2024-07-05 08:58:51 -07:00
Jan Wassenberg 438b1bace2 Fix handling of %c and %q if eot_string. Fixes #283, thanks @ljcucc
PiperOrigin-RevId: 649651535
2024-07-05 07:54:00 -07:00
Jan Wassenberg f823371691 Cleanup: move util/compress and convert_weights to compression/
Also remove unused models/, lint convert_weights

PiperOrigin-RevId: 649613088
2024-07-05 04:16:52 -07:00
Jan Wassenberg 41efec4dba Add Py bindings for weight compression
TODO: this uses clif instead of pybind11, and depends on absl.

PiperOrigin-RevId: 649575815
2024-07-05 01:06:00 -07:00
Jan Wassenberg 118e802b00 Fix gemma_test - moved to evals/.
PiperOrigin-RevId: 649338633
2024-07-04 02:04:05 -07:00
Jan Wassenberg c7c3daa624 7x compile time speedup: shard gemma.cc
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.

PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00
Daniel Keysers a40165dea2 Small cleanups. Fixes gemma_test build.
PiperOrigin-RevId: 649008524
2024-07-03 03:13:38 -07:00
Kan Wu 7e4b20455e Add sliding window attention for Gemma 2.
PiperOrigin-RevId: 648778253
2024-07-02 11:08:03 -07:00
Jan Wassenberg 09a7e75ead Prep for sharding gemma.cc: split into kv_cache, tokenizer.
Move activations.h to backprop/ to make space for another activations.h.

PiperOrigin-RevId: 648744500
2024-07-02 09:31:06 -07:00
Jan Wassenberg 85fcd3cd80 Cleanup: add ModelInfo struct, remove gcpp::
PiperOrigin-RevId: 648707763
2024-07-02 07:11:15 -07:00
Jan Wassenberg b1c1ec1d59 Use benchmark_helper in py bindings (adds BOS)
Also remove thread clamp (OK to be zero or large).

PiperOrigin-RevId: 648657155
2024-07-02 03:27:15 -07:00
Jan Wassenberg e527e7662e Remove unused kSystemPrompt
PiperOrigin-RevId: 648429567
2024-07-01 11:18:07 -07:00
Jan Wassenberg af8eb2fde3 Declutter gemma/ directory, move binaries to evals/ and util/.
PiperOrigin-RevId: 648400795
2024-07-01 09:51:04 -07:00
Jan Wassenberg e588a7f45d Add config for att/final cap, skip max-subtract. Fixes #278
Also update includes/deps for backprop/.

PiperOrigin-RevId: 648399222
2024-07-01 09:45:26 -07:00
The gemma.cpp Authors da7507e6f0 Add prompt batching to Gemma.cpp.
This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts.

PiperOrigin-RevId: 648367559
2024-07-01 07:51:31 -07:00
Paul Chang 8ac5d66575 Introduce new Gemma 9B and 27B configs
PiperOrigin-RevId: 647299080
2024-06-27 06:45:24 -07:00
Paul Chang 78e96fdc70 Refactor model type / training tables, simplify reverse mapping
PiperOrigin-RevId: 647069372
2024-06-26 13:59:14 -07:00
Paul Chang aa57fc3952 Remove unused BUILD dependency
PiperOrigin-RevId: 646519547
2024-06-25 10:12:13 -07:00
The gemma.cpp Authors 7fc8ddf825 Fix a clang tidy warning
PiperOrigin-RevId: 646498062
2024-06-25 09:02:59 -07:00
The gemma.cpp Authors ef786f1bfc Use hwy::ThreadPool::MaxThreads() to determine the number of threads to use.
PiperOrigin-RevId: 646117298
2024-06-24 09:16:04 -07:00
The gemma.cpp Authors 12089417b5 Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline.
PiperOrigin-RevId: 646014268
2024-06-24 02:00:34 -07:00
The gemma.cpp Authors 80b1347393 Skip the last RMSNormInplaceBatched in the Prefill phase.
That only modifies activations.x, but it is called with prefill_activations which are not used after the Prefill call.

PiperOrigin-RevId: 645391387
2024-06-21 08:04:22 -07:00
Copybara-Service 82f16087ba Merge pull request #266 from ufownl:bugfix/kvcache
PiperOrigin-RevId: 645329504
2024-06-21 03:06:52 -07:00
Copybara-Service c2efcb0da4 Merge pull request #267 from ufownl:bugfix/clang_ce
PiperOrigin-RevId: 645329422
2024-06-21 03:06:04 -07:00
RangerUFO f7855251ea Fix compilation errors in clang
It will occur in `ubuntu-latest` of GitHub Actions.
2024-06-21 13:40:40 +08:00
RangerUFO d7787c8f6c Fix KV cache size calculation error 2024-06-21 13:06:26 +08:00
Daniel Keysers 0570972d43 Fixing two typos.
PiperOrigin-RevId: 645103198
2024-06-20 11:33:12 -07:00
The gemma.cpp Authors a85725614a Refactor kCachePosSize and kCacheLayerSize into separate functors.
PiperOrigin-RevId: 645048519
2024-06-20 08:52:08 -07:00
Jan Wassenberg 48ebba8b7a Code cleanup
- Simplify template arg list, enable deduction
- missing hn:: on " Lanes"
- 1.0f suffix
- move RMSNormBatched into ops.h
- static constexpr -> constexpr
- concrete type instead of LayerT, WeightArrayT
- inline GetWeights
- remove if (runtime_config.verbosity
- merge AllocatePrefill and AllocateDecode
- remove bf_ffw_hidden

PiperOrigin-RevId: 644931277
2024-06-20 01:10:24 -07:00
The gemma.cpp Authors 658fb3e506 Move test placeholder to a later pos.
PiperOrigin-RevId: 644808456
2024-06-19 13:24:10 -07:00