Daniel Keysers
a8e08778d4
Add an additional QueryModel() overload to GemmaEnv.
...
Use args only in GemmaEnv constructor, store everything else in RuntimeConfig.
Add runtime option to turn off thread spinning.
PiperOrigin-RevId: 670467320
2024-09-03 02:25:19 -07:00
Zoltan Szabadka
f6abbab3a4
Fix asan failure in local attention computation.
...
PiperOrigin-RevId: 670207380
2024-09-02 07:06:10 -07:00
Jan Wassenberg
4033ed9e78
Avoid duplication of RMSNorm, support all activation/weight types
...
Add test for RMSNorm
Rename VectorizedRopeAndMulBy -> RopeAndMulBy
Move test_util to util/
PiperOrigin-RevId: 668332927
2024-08-28 01:26:55 -07:00
Daniel Keysers
18e6012872
Fix prefill for batched queries.
...
This lets gemma_test/GeographyBatched pass now also for gemma2-27B.
PiperOrigin-RevId: 664827485
2024-08-19 08:50:42 -07:00
Apoorv Reddy
c6eb3b6f0d
VectorizedRopeAndMulBy.
...
~8x reduction (tested on few prompts) in Rope.
~3.8% prefill latency improvement.
~2.6% decode latency improvement.
PiperOrigin-RevId: 664650108
2024-08-18 23:17:01 -07:00
Paul Chang
773333e5be
Expose underlying model configuration: number of layers, heads, etc.
...
PiperOrigin-RevId: 663747853
2024-08-16 09:03:24 -07:00
Jan Wassenberg
301dc8067a
Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul
...
Supports converting all weight/activation formats to native MulT (bf16/f32)
Also:
- ConstMat/MutableMat for const correctness
- Move RowVectorBatch to allocator.h so it can be used from Matmul
- Add matmul.h so MatMulEnv can be used from Activations
- Remove kMaxThreads, detect from PerClusterPools
- Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h
```
zen4 new
64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS.
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS.
zen4 old
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS.
```
PiperOrigin-RevId: 663729812
2024-08-16 07:52:20 -07:00
Paul Chang
b9ed12a325
Support directly observing activations, partially replacing LayersOutputFunc
...
LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs.
Instead, we directly expose the Activations structure.
PiperOrigin-RevId: 663409316
2024-08-15 12:39:07 -07:00
Jan Wassenberg
22995c699d
Simplify pos handling, auto-increment output arg
...
- no longer multiply by num_queries
- remove unused interleaved prompts
- Rename to Queries*
- Rename batch_start/interleaved_pos/pos to queries_pos
PiperOrigin-RevId: 663331823
2024-08-15 09:25:26 -07:00
Copybara-Service
6763afcd1c
Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen
...
PiperOrigin-RevId: 662533529
2024-08-13 08:51:06 -07:00
RangerUFO
8c634f6486
Fix the position calculation issue in the generation phase
2024-08-12 18:50:23 +02:00
RangerUFO
730b6bfc94
Implement `start_pos` per query for batch interface
2024-08-12 18:50:23 +02:00
Jan Wassenberg
b831fa8482
1.3x prefill, 0.95x decode: matmul replacing last matvec
...
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW : 15414 x 4692352 = 24.166318
Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !!
Gen.Embedding : 361 x 49961894 = 6.026297
Gen.Attention.QKV : 15414 x 1005125 = 5.176546
Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357
RopeAndMulBy : 696528 x 11867 = 2.761818
```
After 49.80, 8.68
```
Gen.FFW : 14448 x 5312783 = 25.646868
Gen.Embedding : 338 x 63044815 = 7.119845
Gen.Attention.QKV : 14448 x 1115003 = 5.382557
Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957
RopeAndMulBy : 673344 x 11886 = 2.674156
Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Jan Wassenberg
282f73ec2f
Add pin flag to disable pinning. Refs #338
...
PiperOrigin-RevId: 661389171
2024-08-09 13:47:12 -07:00
Apoorv Reddy
fd1b0743a7
Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B.
...
This is to make it clear that these models are part of the Gemma2 family of models.
PiperOrigin-RevId: 661181682
2024-08-09 02:09:06 -07:00
Jan Wassenberg
2ebbe4076f
1.03-1.08x decode speedup: precompute Rope theta, fuse
...
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.
PiperOrigin-RevId: 661168418
2024-08-09 01:23:24 -07:00
The gemma.cpp Authors
27258b03e6
Improve performance logging
...
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg
5e433e774a
1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
...
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.
PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Phil Culliton
1982a6ba00
Internal change
...
PiperOrigin-RevId: 657831926
2024-07-30 20:24:54 -07:00
Jan Wassenberg
a24eda8d02
Split matmul into matvec; add large matrix benchmark
...
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.
PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang
d37c088e44
Extend LayersOutputFunc to take query index and auxillary int
...
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg
8b4915f321
Fix Windows build - macro conflict with param name
...
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg
6ea4232b2e
MatMul cleanup: Mat struct, simplify args.
...
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.
PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Jan Wassenberg
f27683152c
1.05x prefill speedup: matvec -> matmul for !MHA
...
Also add C_stride and make shape normal non-template arguments.
PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg
2721f54446
Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
...
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg
aaf51898b6
Major revamp #2 of Prefill: fix token order, parallel for multi-query
...
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.
PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Daniel Keysers
2346b5a434
Minor polishing: adding comments, renaming variables.
...
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers
33334ad454
Fix msan uninitialized scale in optimize_test
...
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
Jan Wassenberg
85cac13fb1
Split up ops.h into ops/ops-inl and matmul-inl
...
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg
5844e6a1e5
Cleanup: add wrapper functions and rename vars to interleaved
...
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.
PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg
12016d31c3
Major Prefill/Generate cleanup, 1.3x Prefill speedup
...
This fixes TTFT, which was not including prefill.
PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg
3fe79b3876
Fix msan uninitialized scale
...
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers
e87e65ca45
Add scale parameter to MatMul.
...
Add accessor to CompressedArray that asserts the scale is 1 and use it.
PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers
5a751a9a44
Update gemma-27b to the correct query scaling.
...
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg
992a2cbbc0
De-templatize Activations, add RowVectorBatch class
...
Also remove most kBatchSize args.
PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers
ff34370aac
Simplify FFW by using MatMul_4x4_Batch_Add.
...
Affects only the griffin model, where prefill TPS improves by about 70%.
PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Jan Wassenberg
cd530374b3
Further 1.02x prefill speedup from batch 64->512
...
Measured on SKX. Larger speedup expected for Zen4/SPR.
PiperOrigin-RevId: 652472928
2024-07-15 07:26:10 -07:00
The gemma.cpp Authors
c879133a5a
Increase the prefill batch size to 64.
...
PiperOrigin-RevId: 651754772
2024-07-12 06:28:37 -07:00
The gemma.cpp Authors
df3fb70802
Improve readability with RepeatedAttentionWindowSizes
...
PiperOrigin-RevId: 651431738
2024-07-11 09:11:46 -07:00
Jan Wassenberg
edaf61b983
SVE build fix: avoid capturing vectors directly.
...
Also use more V typedef instead of auto.
PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg
be765afce2
Simplify matmul: only 2 overloads
...
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.
1.02-1.06x speedup.
PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov
3e92088595
Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
...
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.
Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00
Kan Wu
f519ab6693
Refactor configurables.
...
PiperOrigin-RevId: 651259154
2024-07-10 21:30:58 -07:00
Andrey Vlasov
960ff4b4ec
Record time measurements in MatMul tests.
...
PiperOrigin-RevId: 651060711
2024-07-10 10:04:40 -07:00
Daniel Keysers
063bbaa683
Add more comments to attention computation (and some small restructuring).
...
PiperOrigin-RevId: 650929097
2024-07-10 02:39:07 -07:00
Jan Wassenberg
6a3f7cf3ea
Lint fix - string append, remove stale TODO
...
PiperOrigin-RevId: 650197468
2024-07-08 04:11:21 -07:00
Jan Wassenberg
cbb67b4ee0
Move benchmark_helper to evals/, weights_raw to compression/.
...
PiperOrigin-RevId: 650155983
2024-07-08 01:13:23 -07:00
Jan Wassenberg
438b1bace2
Fix handling of %c and %q if eot_string. Fixes #283 , thanks @ljcucc
...
PiperOrigin-RevId: 649651535
2024-07-05 07:54:00 -07:00
Jan Wassenberg
118e802b00
Fix gemma_test - moved to evals/.
...
PiperOrigin-RevId: 649338633
2024-07-04 02:04:05 -07:00
Jan Wassenberg
c7c3daa624
7x compile time speedup: shard gemma.cc
...
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.
PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00