Commit Graph

480 Commits

Author SHA1 Message Date
Daniel Keysers a8e08778d4 Add an additional QueryModel() overload to GemmaEnv.
Use args only in GemmaEnv constructor, store everything else in RuntimeConfig.
Add runtime option to turn off thread spinning.

PiperOrigin-RevId: 670467320
2024-09-03 02:25:19 -07:00
Zoltan Szabadka f6abbab3a4 Fix asan failure in local attention computation.
PiperOrigin-RevId: 670207380
2024-09-02 07:06:10 -07:00
Paul Chang 22d9476aad Demonstrate constrained decoding in gemma_cpp's hello world example
PiperOrigin-RevId: 669327521
2024-08-30 08:03:07 -07:00
Jan Wassenberg 4033ed9e78 Avoid duplication of RMSNorm, support all activation/weight types
Add test for RMSNorm
Rename VectorizedRopeAndMulBy -> RopeAndMulBy

Move test_util to util/

PiperOrigin-RevId: 668332927
2024-08-28 01:26:55 -07:00
Daniel Keysers 3c17911875 Make gemma_test slightly more allowing on MultiTurn.
PiperOrigin-RevId: 668097277
2024-08-27 12:40:16 -07:00
Jan Wassenberg 2308514e5a Experiment with compensated dot product.
ULP difference vs exact is 0..1, vs 200-5000 for previous.
Runtime overhead is 2.5-4x for f32 input.

PiperOrigin-RevId: 668084019
2024-08-27 12:05:35 -07:00
Jan Wassenberg b6d0ca8a14 Minor followup: remainder handling is a single iteration
Also add profiler annotations.

PiperOrigin-RevId: 667883774
2024-08-27 01:19:44 -07:00
Jan Wassenberg c4303cd89b Fix test for 2b - update prompt
PiperOrigin-RevId: 667878053
2024-08-27 00:56:47 -07:00
Apoorv Reddy 48d0801fb0 Vectorize Rope for qkv dim not evenly divisible by number of lanes.
PiperOrigin-RevId: 665776602
2024-08-21 02:22:22 -07:00
Daniel Keysers 18e6012872 Fix prefill for batched queries.
This lets gemma_test/GeographyBatched pass now also for gemma2-27B.

PiperOrigin-RevId: 664827485
2024-08-19 08:50:42 -07:00
Apoorv Reddy c6eb3b6f0d VectorizedRopeAndMulBy.
~8x reduction (tested on few prompts) in Rope.
~3.8% prefill latency improvement.
~2.6% decode latency improvement.

PiperOrigin-RevId: 664650108
2024-08-18 23:17:01 -07:00
Paul Chang 773333e5be Expose underlying model configuration: number of layers, heads, etc.
PiperOrigin-RevId: 663747853
2024-08-16 09:03:24 -07:00
Jan Wassenberg 301dc8067a Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul
Supports converting all weight/activation formats to native MulT (bf16/f32)

Also:
- ConstMat/MutableMat for const correctness
- Move RowVectorBatch to allocator.h so it can be used from Matmul
- Add matmul.h so MatMulEnv can be used from Activations
- Remove kMaxThreads, detect from PerClusterPools
- Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h

```
zen4 new
64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp:   616.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp:   460.7 GFLOPS.
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    598.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    435.6 GFLOPS.

zen4 old
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    257.5 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    231.9 GFLOPS.
```

PiperOrigin-RevId: 663729812
2024-08-16 07:52:20 -07:00
The gemma.cpp Authors 6c57feb52f Automated Code Change
PiperOrigin-RevId: 663622838
2024-08-16 00:01:24 -07:00
Paul Chang b9ed12a325 Support directly observing activations, partially replacing LayersOutputFunc
LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs.
Instead, we directly expose the Activations structure.

PiperOrigin-RevId: 663409316
2024-08-15 12:39:07 -07:00
Jan Wassenberg 22995c699d Simplify pos handling, auto-increment output arg
- no longer multiply by num_queries
- remove unused interleaved prompts
- Rename to Queries*
- Rename batch_start/interleaved_pos/pos to queries_pos

PiperOrigin-RevId: 663331823
2024-08-15 09:25:26 -07:00
Copybara-Service 6763afcd1c Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen
PiperOrigin-RevId: 662533529
2024-08-13 08:51:06 -07:00
RangerUFO 8c634f6486 Fix the position calculation issue in the generation phase 2024-08-12 18:50:23 +02:00
RangerUFO ea72575e56 Fix build issues when tests are enabled 2024-08-12 18:50:23 +02:00
RangerUFO 730b6bfc94 Implement `start_pos` per query for batch interface 2024-08-12 18:50:23 +02:00
Jan Wassenberg 8e028632f7 0.98x prefill: refactor in prep for cache blocking.
Slower because we now init tiles of C and accumulate into them.

Also remove unused var in optimize_test and use BF16 typedef.

PiperOrigin-RevId: 662115916
2024-08-12 09:26:29 -07:00
Daniel Keysers 7316ee8f96 Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it.
PiperOrigin-RevId: 662072395
2024-08-12 07:12:46 -07:00
Jan Wassenberg b831fa8482 1.3x prefill, 0.95x decode: matmul replacing last matvec
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW                                 :      15414 x         4692352 = 24.166318
Gen.Attention.SumHeads                  :      15414 x         1394804 =  7.183451 !!
Gen.Embedding                           :        361 x        49961894 =  6.026297
Gen.Attention.QKV                       :      15414 x         1005125 =  5.176546
Gen.Attention.DotSoftmax                :      15414 x          885480 =  4.560357
RopeAndMulBy                            :     696528 x           11867 =  2.761818
```

After 49.80, 8.68
```
Gen.FFW                                 :      14448 x         5312783 = 25.646868
Gen.Embedding                           :        338 x        63044815 =  7.119845
Gen.Attention.QKV                       :      14448 x         1115003 =  5.382557
Gen.Attention.DotSoftmax                :      14448 x          897577 =  4.332957
RopeAndMulBy                            :     673344 x           11886 =  2.674156
Gen.Attention.SumHeads                  :      14448 x          518291 =  2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Jan Wassenberg 282f73ec2f Add pin flag to disable pinning. Refs #338
PiperOrigin-RevId: 661389171
2024-08-09 13:47:12 -07:00
Apoorv Reddy fd1b0743a7 Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B.
This is to make it clear that these models are part of the Gemma2 family of models.

PiperOrigin-RevId: 661181682
2024-08-09 02:09:06 -07:00
Jan Wassenberg 2ebbe4076f 1.03-1.08x decode speedup: precompute Rope theta, fuse
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.

PiperOrigin-RevId: 661168418
2024-08-09 01:23:24 -07:00
The gemma.cpp Authors 27258b03e6 Improve performance logging
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg 4154f5a910 Document Gemma 2 model names
PiperOrigin-RevId: 659858832
2024-08-06 01:44:15 -07:00
Jan Wassenberg 5e433e774a 1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.

PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Jan Wassenberg 1617e1a33d SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill
12->9 ops by recognizing the upper/lower bytes are simply shifted.

PiperOrigin-RevId: 659609241
2024-08-05 10:59:13 -07:00
Phil Culliton 1982a6ba00 Internal change
PiperOrigin-RevId: 657831926
2024-07-30 20:24:54 -07:00
Jan Wassenberg a24eda8d02 Split matmul into matvec; add large matrix benchmark
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.

PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang d37c088e44 Extend LayersOutputFunc to take query index and auxillary int
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg 8b4915f321 Fix Windows build - macro conflict with param name
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg 6ea4232b2e MatMul cleanup: Mat struct, simplify args.
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.

PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Thomas Fischbacher d9f86f8e4d Add Python code for converting Griffin Orbax weights. Refs #301
PiperOrigin-RevId: 657296255
2024-07-29 12:53:30 -07:00
Jan Wassenberg f27683152c 1.05x prefill speedup: matvec -> matmul for !MHA
Also add C_stride and make shape normal non-template arguments.

PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg 2721f54446 Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg aaf51898b6 Major revamp #2 of Prefill: fix token order, parallel for multi-query
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.

PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
The gemma.cpp Authors c1f243c351 Fix setting scales in Py binding
PiperOrigin-RevId: 655284183
2024-07-23 13:32:50 -07:00
Daniel Keysers 2346b5a434 Minor polishing: adding comments, renaming variables.
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers 33334ad454 Fix msan uninitialized scale in optimize_test
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
The gemma.cpp Authors 74a6dc8f33 Use all CPU sockets when pinning threads to cores
PiperOrigin-RevId: 654800375
2024-07-22 10:09:16 -07:00
Jan Wassenberg 85cac13fb1 Split up ops.h into ops/ops-inl and matmul-inl
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg 5844e6a1e5 Cleanup: add wrapper functions and rename vars to interleaved
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.

PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg 12016d31c3 Major Prefill/Generate cleanup, 1.3x Prefill speedup
This fixes TTFT, which was not including prefill.

PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg 3fe79b3876 Fix msan uninitialized scale
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers 5a751a9a44 Update gemma-27b to the correct query scaling.
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00