Commit Graph

621 Commits

Author SHA1 Message Date
Daniel Keysers 18e6012872 Fix prefill for batched queries.
This lets gemma_test/GeographyBatched pass now also for gemma2-27B.

PiperOrigin-RevId: 664827485
2024-08-19 08:50:42 -07:00
Apoorv Reddy c6eb3b6f0d VectorizedRopeAndMulBy.
~8x reduction (tested on few prompts) in Rope.
~3.8% prefill latency improvement.
~2.6% decode latency improvement.

PiperOrigin-RevId: 664650108
2024-08-18 23:17:01 -07:00
Paul Chang 773333e5be Expose underlying model configuration: number of layers, heads, etc.
PiperOrigin-RevId: 663747853
2024-08-16 09:03:24 -07:00
Jan Wassenberg 301dc8067a Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul
Supports converting all weight/activation formats to native MulT (bf16/f32)

Also:
- ConstMat/MutableMat for const correctness
- Move RowVectorBatch to allocator.h so it can be used from Matmul
- Add matmul.h so MatMulEnv can be used from Activations
- Remove kMaxThreads, detect from PerClusterPools
- Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h

```
zen4 new
64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp:   616.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp:   460.7 GFLOPS.
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    598.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    435.6 GFLOPS.

zen4 old
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    257.5 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    231.9 GFLOPS.
```

PiperOrigin-RevId: 663729812
2024-08-16 07:52:20 -07:00
The gemma.cpp Authors 6c57feb52f Automated Code Change
PiperOrigin-RevId: 663622838
2024-08-16 00:01:24 -07:00
Paul Chang b9ed12a325 Support directly observing activations, partially replacing LayersOutputFunc
LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs.
Instead, we directly expose the Activations structure.

PiperOrigin-RevId: 663409316
2024-08-15 12:39:07 -07:00
Jan Wassenberg 22995c699d Simplify pos handling, auto-increment output arg
- no longer multiply by num_queries
- remove unused interleaved prompts
- Rename to Queries*
- Rename batch_start/interleaved_pos/pos to queries_pos

PiperOrigin-RevId: 663331823
2024-08-15 09:25:26 -07:00
Copybara-Service 6763afcd1c Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen
PiperOrigin-RevId: 662533529
2024-08-13 08:51:06 -07:00
RangerUFO 8c634f6486 Fix the position calculation issue in the generation phase 2024-08-12 18:50:23 +02:00
RangerUFO ea72575e56 Fix build issues when tests are enabled 2024-08-12 18:50:23 +02:00
RangerUFO 730b6bfc94 Implement `start_pos` per query for batch interface 2024-08-12 18:50:23 +02:00
Jan Wassenberg 8e028632f7 0.98x prefill: refactor in prep for cache blocking.
Slower because we now init tiles of C and accumulate into them.

Also remove unused var in optimize_test and use BF16 typedef.

PiperOrigin-RevId: 662115916
2024-08-12 09:26:29 -07:00
Daniel Keysers 7316ee8f96 Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it.
PiperOrigin-RevId: 662072395
2024-08-12 07:12:46 -07:00
Jan Wassenberg b831fa8482 1.3x prefill, 0.95x decode: matmul replacing last matvec
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW                                 :      15414 x         4692352 = 24.166318
Gen.Attention.SumHeads                  :      15414 x         1394804 =  7.183451 !!
Gen.Embedding                           :        361 x        49961894 =  6.026297
Gen.Attention.QKV                       :      15414 x         1005125 =  5.176546
Gen.Attention.DotSoftmax                :      15414 x          885480 =  4.560357
RopeAndMulBy                            :     696528 x           11867 =  2.761818
```

After 49.80, 8.68
```
Gen.FFW                                 :      14448 x         5312783 = 25.646868
Gen.Embedding                           :        338 x        63044815 =  7.119845
Gen.Attention.QKV                       :      14448 x         1115003 =  5.382557
Gen.Attention.DotSoftmax                :      14448 x          897577 =  4.332957
RopeAndMulBy                            :     673344 x           11886 =  2.674156
Gen.Attention.SumHeads                  :      14448 x          518291 =  2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Jan Wassenberg 282f73ec2f Add pin flag to disable pinning. Refs #338
PiperOrigin-RevId: 661389171
2024-08-09 13:47:12 -07:00
Apoorv Reddy fd1b0743a7 Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B.
This is to make it clear that these models are part of the Gemma2 family of models.

PiperOrigin-RevId: 661181682
2024-08-09 02:09:06 -07:00
Jan Wassenberg 2ebbe4076f 1.03-1.08x decode speedup: precompute Rope theta, fuse
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.

PiperOrigin-RevId: 661168418
2024-08-09 01:23:24 -07:00
The gemma.cpp Authors 27258b03e6 Improve performance logging
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg 4154f5a910 Document Gemma 2 model names
PiperOrigin-RevId: 659858832
2024-08-06 01:44:15 -07:00
Jan Wassenberg 5e433e774a 1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.

PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Jan Wassenberg 1617e1a33d SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill
12->9 ops by recognizing the upper/lower bytes are simply shifted.

PiperOrigin-RevId: 659609241
2024-08-05 10:59:13 -07:00
Phil Culliton 1982a6ba00 Internal change
PiperOrigin-RevId: 657831926
2024-07-30 20:24:54 -07:00
Jan Wassenberg a24eda8d02 Split matmul into matvec; add large matrix benchmark
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.

PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang d37c088e44 Extend LayersOutputFunc to take query index and auxillary int
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg 8b4915f321 Fix Windows build - macro conflict with param name
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg 6ea4232b2e MatMul cleanup: Mat struct, simplify args.
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.

PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Thomas Fischbacher d9f86f8e4d Add Python code for converting Griffin Orbax weights. Refs #301
PiperOrigin-RevId: 657296255
2024-07-29 12:53:30 -07:00
Jan Wassenberg f27683152c 1.05x prefill speedup: matvec -> matmul for !MHA
Also add C_stride and make shape normal non-template arguments.

PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg 2721f54446 Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg aaf51898b6 Major revamp #2 of Prefill: fix token order, parallel for multi-query
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.

PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
The gemma.cpp Authors c1f243c351 Fix setting scales in Py binding
PiperOrigin-RevId: 655284183
2024-07-23 13:32:50 -07:00
Daniel Keysers 2346b5a434 Minor polishing: adding comments, renaming variables.
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers 33334ad454 Fix msan uninitialized scale in optimize_test
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
The gemma.cpp Authors 74a6dc8f33 Use all CPU sockets when pinning threads to cores
PiperOrigin-RevId: 654800375
2024-07-22 10:09:16 -07:00
Jan Wassenberg 85cac13fb1 Split up ops.h into ops/ops-inl and matmul-inl
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg 5844e6a1e5 Cleanup: add wrapper functions and rename vars to interleaved
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.

PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg 12016d31c3 Major Prefill/Generate cleanup, 1.3x Prefill speedup
This fixes TTFT, which was not including prefill.

PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg 3fe79b3876 Fix msan uninitialized scale
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers 5a751a9a44 Update gemma-27b to the correct query scaling.
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers ff34370aac Simplify FFW by using MatMul_4x4_Batch_Add.
Affects only the griffin model, where prefill TPS improves by about 70%.

PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Paul Chang 48b900b1b9 Fix examples/hello_world for real.
PiperOrigin-RevId: 652509319
2024-07-15 09:38:52 -07:00
Jan Wassenberg cd530374b3 Further 1.02x prefill speedup from batch 64->512
Measured on SKX. Larger speedup expected for Zen4/SPR.

PiperOrigin-RevId: 652472928
2024-07-15 07:26:10 -07:00
Paul Chang aaee666a1d Fix gemma_cpp/examples/hello_world build.
Include Bazel build rules, too.

PiperOrigin-RevId: 652469406
2024-07-15 07:11:01 -07:00
The gemma.cpp Authors c879133a5a Increase the prefill batch size to 64.
PiperOrigin-RevId: 651754772
2024-07-12 06:28:37 -07:00
The gemma.cpp Authors df3fb70802 Improve readability with RepeatedAttentionWindowSizes
PiperOrigin-RevId: 651431738
2024-07-11 09:11:46 -07:00
Jan Wassenberg edaf61b983 SVE build fix: avoid capturing vectors directly.
Also use more V typedef instead of auto.

PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg be765afce2 Simplify matmul: only 2 overloads
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.

1.02-1.06x speedup.

PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov 3e92088595 Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.

Measurements for a 2b-it sfp-encoded model on a  AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00