Commit Graph

241 Commits

Author SHA1 Message Date
Krzysztof Ostrowski b3239bf509 Internal change.
PiperOrigin-RevId: 681530185
2024-10-02 11:33:06 -07:00
Jan Wassenberg 96d2ab7d31 Minor fix to profiler zone and add comment
PiperOrigin-RevId: 681350546
2024-10-02 01:37:50 -07:00
Jan Wassenberg 7d9fcda0d8 -467ms startup: parallel Reshape
Also split Softmax into Argmax helper, add comments;
add profiler zones + fix IDE warning

PiperOrigin-RevId: 680954573
2024-10-01 04:11:35 -07:00
Jan Wassenberg 2d14d796e3 1.09x decode speedup for topk=1/temp0: fuse softmax and sample
PiperOrigin-RevId: 680589099
2024-09-30 08:37:41 -07:00
Jan Wassenberg 897f902d28 Fix include order, required to build with profiler enabled
PiperOrigin-RevId: 680574177
2024-09-30 07:52:50 -07:00
Daniel Keysers 606427022c Fix compiler errors when trying to generate (unused) code for the ConfigNoVit struct.
PiperOrigin-RevId: 679049377
2024-09-26 01:55:26 -07:00
RangerUFO d1010337c3 Fix prefix-LM mode assertion 2024-09-25 22:22:28 +08:00
Daniel Keysers f8835fe4a4 Add support for PaliGemma Vision-LM (224x224) to gemma.cpp
See https://arxiv.org/abs/2407.07726 for a description of the model.
Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that.

PiperOrigin-RevId: 677841119
2024-09-23 10:09:38 -07:00
Jan Wassenberg 8c0a8834c1 Major compression update, arbitrary-len unpack + new Dot
Compression:
* Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad
* New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test
* Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking
* NUQ: support arbitrary-length enc/dec
* New compression/shared, remove sfp.h and nuq.h
* Move Store2 into Traits and provide Compress2 wrapper
* Remove unused Decompress()-with-pool overload
* Simplify CompressedArrayLen, rename to CompressedArrayElements
* Remove unused DistortionStats b_l1_

Misc:
* Add compensated and Kahan dot, support any length
* Use same Dot function everywhere
* Move exact arithmetic functions into fp_arith
* use FloatPtr and MatPtr typedefs in tests; less stack usage
* Rename args to packed/raw
* Remove Traits::Name, instead TypeName<T>()
* Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream
PiperOrigin-RevId: 672868468
2024-09-10 02:22:19 -07:00
Jan Wassenberg c29e9752c7 Refactor/cleanup, remove even_odd
* New compression/shared.h, remove sfp.h
* Remove unused DistortionStats b_l1_
* Move exact arithmetic functions into fp_arith
* Remove even_odd optimization for MatVec (mostly unused)
* use BF16 typedef more widely
* Add kMaxSFP constant

PiperOrigin-RevId: 670996386
2024-09-04 09:25:13 -07:00
Daniel Keysers a8e08778d4 Add an additional QueryModel() overload to GemmaEnv.
Use args only in GemmaEnv constructor, store everything else in RuntimeConfig.
Add runtime option to turn off thread spinning.

PiperOrigin-RevId: 670467320
2024-09-03 02:25:19 -07:00
Zoltan Szabadka f6abbab3a4 Fix asan failure in local attention computation.
PiperOrigin-RevId: 670207380
2024-09-02 07:06:10 -07:00
Jan Wassenberg 4033ed9e78 Avoid duplication of RMSNorm, support all activation/weight types
Add test for RMSNorm
Rename VectorizedRopeAndMulBy -> RopeAndMulBy

Move test_util to util/

PiperOrigin-RevId: 668332927
2024-08-28 01:26:55 -07:00
Daniel Keysers 18e6012872 Fix prefill for batched queries.
This lets gemma_test/GeographyBatched pass now also for gemma2-27B.

PiperOrigin-RevId: 664827485
2024-08-19 08:50:42 -07:00
Apoorv Reddy c6eb3b6f0d VectorizedRopeAndMulBy.
~8x reduction (tested on few prompts) in Rope.
~3.8% prefill latency improvement.
~2.6% decode latency improvement.

PiperOrigin-RevId: 664650108
2024-08-18 23:17:01 -07:00
Paul Chang 773333e5be Expose underlying model configuration: number of layers, heads, etc.
PiperOrigin-RevId: 663747853
2024-08-16 09:03:24 -07:00
Jan Wassenberg 301dc8067a Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul
Supports converting all weight/activation formats to native MulT (bf16/f32)

Also:
- ConstMat/MutableMat for const correctness
- Move RowVectorBatch to allocator.h so it can be used from Matmul
- Add matmul.h so MatMulEnv can be used from Activations
- Remove kMaxThreads, detect from PerClusterPools
- Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h

```
zen4 new
64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp:   616.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp:   460.7 GFLOPS.
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    598.6 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    435.6 GFLOPS.

zen4 old
64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp:    257.5 GFLOPS.
64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp:    231.9 GFLOPS.
```

PiperOrigin-RevId: 663729812
2024-08-16 07:52:20 -07:00
Paul Chang b9ed12a325 Support directly observing activations, partially replacing LayersOutputFunc
LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs.
Instead, we directly expose the Activations structure.

PiperOrigin-RevId: 663409316
2024-08-15 12:39:07 -07:00
Jan Wassenberg 22995c699d Simplify pos handling, auto-increment output arg
- no longer multiply by num_queries
- remove unused interleaved prompts
- Rename to Queries*
- Rename batch_start/interleaved_pos/pos to queries_pos

PiperOrigin-RevId: 663331823
2024-08-15 09:25:26 -07:00
Copybara-Service 6763afcd1c Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen
PiperOrigin-RevId: 662533529
2024-08-13 08:51:06 -07:00
RangerUFO 8c634f6486 Fix the position calculation issue in the generation phase 2024-08-12 18:50:23 +02:00
RangerUFO 730b6bfc94 Implement `start_pos` per query for batch interface 2024-08-12 18:50:23 +02:00
Jan Wassenberg b831fa8482 1.3x prefill, 0.95x decode: matmul replacing last matvec
Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok)
```
Gen.FFW                                 :      15414 x         4692352 = 24.166318
Gen.Attention.SumHeads                  :      15414 x         1394804 =  7.183451 !!
Gen.Embedding                           :        361 x        49961894 =  6.026297
Gen.Attention.QKV                       :      15414 x         1005125 =  5.176546
Gen.Attention.DotSoftmax                :      15414 x          885480 =  4.560357
RopeAndMulBy                            :     696528 x           11867 =  2.761818
```

After 49.80, 8.68
```
Gen.FFW                                 :      14448 x         5312783 = 25.646868
Gen.Embedding                           :        338 x        63044815 =  7.119845
Gen.Attention.QKV                       :      14448 x         1115003 =  5.382557
Gen.Attention.DotSoftmax                :      14448 x          897577 =  4.332957
RopeAndMulBy                            :     673344 x           11886 =  2.674156
Gen.Attention.SumHeads                  :      14448 x          518291 =  2.501993 !!
```
PiperOrigin-RevId: 662024085
2024-08-12 03:36:01 -07:00
Jan Wassenberg 282f73ec2f Add pin flag to disable pinning. Refs #338
PiperOrigin-RevId: 661389171
2024-08-09 13:47:12 -07:00
Apoorv Reddy fd1b0743a7 Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B.
This is to make it clear that these models are part of the Gemma2 family of models.

PiperOrigin-RevId: 661181682
2024-08-09 02:09:06 -07:00
Jan Wassenberg 2ebbe4076f 1.03-1.08x decode speedup: precompute Rope theta, fuse
Split attention into functions, move into class.
Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV.
Sink if() into MaybeLogitsSoftCap.

PiperOrigin-RevId: 661168418
2024-08-09 01:23:24 -07:00
The gemma.cpp Authors 27258b03e6 Improve performance logging
PiperOrigin-RevId: 660534330
2024-08-07 14:15:43 -07:00
Jan Wassenberg 5e433e774a 1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism.
Limit thread counts to detected. Add max_clusters arg.
Update detection logic to check for smt0 - previously we pinned to some siblings.

PiperOrigin-RevId: 659755311
2024-08-05 18:50:09 -07:00
Phil Culliton 1982a6ba00 Internal change
PiperOrigin-RevId: 657831926
2024-07-30 20:24:54 -07:00
Jan Wassenberg a24eda8d02 Split matmul into matvec; add large matrix benchmark
Rename var names to row/col for more clarity.
Better estimate error tolerance via max abs col sum.

PiperOrigin-RevId: 657601791
2024-07-30 08:29:11 -07:00
Paul Chang d37c088e44 Extend LayersOutputFunc to take query index and auxillary int
PiperOrigin-RevId: 657574814
2024-07-30 06:53:56 -07:00
Jan Wassenberg 8b4915f321 Fix Windows build - macro conflict with param name
PiperOrigin-RevId: 657518587
2024-07-30 03:22:32 -07:00
Jan Wassenberg 6ea4232b2e MatMul cleanup: Mat struct, simplify args.
Add large benchmark to test, use 4 threads, skip some targets.
Also use Traits::Name instead of typeid.

PiperOrigin-RevId: 657496185
2024-07-30 01:55:50 -07:00
Jan Wassenberg f27683152c 1.05x prefill speedup: matvec -> matmul for !MHA
Also add C_stride and make shape normal non-template arguments.

PiperOrigin-RevId: 657285945
2024-07-29 12:18:06 -07:00
Jan Wassenberg 2721f54446 Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup
PiperOrigin-RevId: 657167257
2024-07-29 05:34:26 -07:00
Jan Wassenberg aaf51898b6 Major revamp #2 of Prefill: fix token order, parallel for multi-query
- Allocate only the required KV caches and activation batch size
- Add flags for batch sizes
- Const-correct interface: Span of const int.
- Also clean up the KVCache arg to a span.
- Move kPrefillBatchSize into RuntimeConfig and remove related global constants.

PiperOrigin-RevId: 655893197
2024-07-25 03:28:55 -07:00
Daniel Keysers 2346b5a434 Minor polishing: adding comments, renaming variables.
PiperOrigin-RevId: 655235006
2024-07-23 11:17:44 -07:00
Daniel Keysers 33334ad454 Fix msan uninitialized scale in optimize_test
PiperOrigin-RevId: 654817460
2024-07-22 10:50:25 -07:00
Jan Wassenberg 85cac13fb1 Split up ops.h into ops/ops-inl and matmul-inl
PiperOrigin-RevId: 654068303
2024-07-19 11:21:48 -07:00
Jan Wassenberg 5844e6a1e5 Cleanup: add wrapper functions and rename vars to interleaved
Simplifies the TransformerLayer function.
Use interleaved* instead of _and_queries.

PiperOrigin-RevId: 653929449
2024-07-19 02:04:11 -07:00
Jan Wassenberg 12016d31c3 Major Prefill/Generate cleanup, 1.3x Prefill speedup
This fixes TTFT, which was not including prefill.

PiperOrigin-RevId: 653690626
2024-07-18 11:16:46 -07:00
Jan Wassenberg 3fe79b3876 Fix msan uninitialized scale
PiperOrigin-RevId: 653655471
2024-07-18 09:42:31 -07:00
Daniel Keysers e87e65ca45 Add scale parameter to MatMul.
Add accessor to CompressedArray that asserts the scale is 1 and use it.

PiperOrigin-RevId: 653604840
2024-07-18 06:58:56 -07:00
Daniel Keysers 5a751a9a44 Update gemma-27b to the correct query scaling.
PiperOrigin-RevId: 653201646
2024-07-17 05:43:52 -07:00
Jan Wassenberg 992a2cbbc0 De-templatize Activations, add RowVectorBatch class
Also remove most kBatchSize args.

PiperOrigin-RevId: 653185525
2024-07-17 04:38:15 -07:00
Daniel Keysers ff34370aac Simplify FFW by using MatMul_4x4_Batch_Add.
Affects only the griffin model, where prefill TPS improves by about 70%.

PiperOrigin-RevId: 652878176
2024-07-16 09:41:23 -07:00
Jan Wassenberg cd530374b3 Further 1.02x prefill speedup from batch 64->512
Measured on SKX. Larger speedup expected for Zen4/SPR.

PiperOrigin-RevId: 652472928
2024-07-15 07:26:10 -07:00
The gemma.cpp Authors c879133a5a Increase the prefill batch size to 64.
PiperOrigin-RevId: 651754772
2024-07-12 06:28:37 -07:00
The gemma.cpp Authors df3fb70802 Improve readability with RepeatedAttentionWindowSizes
PiperOrigin-RevId: 651431738
2024-07-11 09:11:46 -07:00
Jan Wassenberg edaf61b983 SVE build fix: avoid capturing vectors directly.
Also use more V typedef instead of auto.

PiperOrigin-RevId: 651423685
2024-07-11 08:43:56 -07:00
Jan Wassenberg be765afce2 Simplify matmul: only 2 overloads
Also add StoreHorizontalSumsMaybeAdd wrapper function,
move MatMulSlowBatch into test.

1.02-1.06x speedup.

PiperOrigin-RevId: 651394791
2024-07-11 06:58:42 -07:00
Andrey Vlasov 3e92088595 Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing
SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations.

Measurements for a 2b-it sfp-encoded model on a  AMD Ryzen Threadripper PRO 3945WX 12-Cores:
baseline:
```
32.6254 prefill tokens / sec
8.91429 tokens / sec
115 milliseconds time to first token
```
this change:
```
54.3045 prefill tokens / sec
16.8191 tokens / sec
56 milliseconds time to first token
```
PiperOrigin-RevId: 651369694
2024-07-11 05:13:39 -07:00
Kan Wu f519ab6693 Refactor configurables.
PiperOrigin-RevId: 651259154
2024-07-10 21:30:58 -07:00
Andrey Vlasov 960ff4b4ec Record time measurements in MatMul tests.
PiperOrigin-RevId: 651060711
2024-07-10 10:04:40 -07:00
Daniel Keysers 063bbaa683 Add more comments to attention computation (and some small restructuring).
PiperOrigin-RevId: 650929097
2024-07-10 02:39:07 -07:00
Jan Wassenberg 6a3f7cf3ea Lint fix - string append, remove stale TODO
PiperOrigin-RevId: 650197468
2024-07-08 04:11:21 -07:00
Jan Wassenberg cbb67b4ee0 Move benchmark_helper to evals/, weights_raw to compression/.
PiperOrigin-RevId: 650155983
2024-07-08 01:13:23 -07:00
Jan Wassenberg 438b1bace2 Fix handling of %c and %q if eot_string. Fixes #283, thanks @ljcucc
PiperOrigin-RevId: 649651535
2024-07-05 07:54:00 -07:00
Jan Wassenberg 118e802b00 Fix gemma_test - moved to evals/.
PiperOrigin-RevId: 649338633
2024-07-04 02:04:05 -07:00
Jan Wassenberg c7c3daa624 7x compile time speedup: shard gemma.cc
Use overloaded functions defined in gemma/instantiations.
Also split out activations.h.

PiperOrigin-RevId: 649053122
2024-07-03 06:35:04 -07:00
Daniel Keysers a40165dea2 Small cleanups. Fixes gemma_test build.
PiperOrigin-RevId: 649008524
2024-07-03 03:13:38 -07:00
Kan Wu 7e4b20455e Add sliding window attention for Gemma 2.
PiperOrigin-RevId: 648778253
2024-07-02 11:08:03 -07:00
Jan Wassenberg 09a7e75ead Prep for sharding gemma.cc: split into kv_cache, tokenizer.
Move activations.h to backprop/ to make space for another activations.h.

PiperOrigin-RevId: 648744500
2024-07-02 09:31:06 -07:00
Jan Wassenberg 85fcd3cd80 Cleanup: add ModelInfo struct, remove gcpp::
PiperOrigin-RevId: 648707763
2024-07-02 07:11:15 -07:00
Jan Wassenberg b1c1ec1d59 Use benchmark_helper in py bindings (adds BOS)
Also remove thread clamp (OK to be zero or large).

PiperOrigin-RevId: 648657155
2024-07-02 03:27:15 -07:00
Jan Wassenberg e527e7662e Remove unused kSystemPrompt
PiperOrigin-RevId: 648429567
2024-07-01 11:18:07 -07:00
Jan Wassenberg af8eb2fde3 Declutter gemma/ directory, move binaries to evals/ and util/.
PiperOrigin-RevId: 648400795
2024-07-01 09:51:04 -07:00
Jan Wassenberg e588a7f45d Add config for att/final cap, skip max-subtract. Fixes #278
Also update includes/deps for backprop/.

PiperOrigin-RevId: 648399222
2024-07-01 09:45:26 -07:00
The gemma.cpp Authors da7507e6f0 Add prompt batching to Gemma.cpp.
This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts.

PiperOrigin-RevId: 648367559
2024-07-01 07:51:31 -07:00
Paul Chang 8ac5d66575 Introduce new Gemma 9B and 27B configs
PiperOrigin-RevId: 647299080
2024-06-27 06:45:24 -07:00
Paul Chang 78e96fdc70 Refactor model type / training tables, simplify reverse mapping
PiperOrigin-RevId: 647069372
2024-06-26 13:59:14 -07:00
The gemma.cpp Authors 7fc8ddf825 Fix a clang tidy warning
PiperOrigin-RevId: 646498062
2024-06-25 09:02:59 -07:00
The gemma.cpp Authors 12089417b5 Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline.
PiperOrigin-RevId: 646014268
2024-06-24 02:00:34 -07:00
The gemma.cpp Authors 80b1347393 Skip the last RMSNormInplaceBatched in the Prefill phase.
That only modifies activations.x, but it is called with prefill_activations which are not used after the Prefill call.

PiperOrigin-RevId: 645391387
2024-06-21 08:04:22 -07:00
Copybara-Service 82f16087ba Merge pull request #266 from ufownl:bugfix/kvcache
PiperOrigin-RevId: 645329504
2024-06-21 03:06:52 -07:00
RangerUFO f7855251ea Fix compilation errors in clang
It will occur in `ubuntu-latest` of GitHub Actions.
2024-06-21 13:40:40 +08:00
RangerUFO d7787c8f6c Fix KV cache size calculation error 2024-06-21 13:06:26 +08:00
Daniel Keysers 0570972d43 Fixing two typos.
PiperOrigin-RevId: 645103198
2024-06-20 11:33:12 -07:00
The gemma.cpp Authors a85725614a Refactor kCachePosSize and kCacheLayerSize into separate functors.
PiperOrigin-RevId: 645048519
2024-06-20 08:52:08 -07:00
Jan Wassenberg 48ebba8b7a Code cleanup
- Simplify template arg list, enable deduction
- missing hn:: on " Lanes"
- 1.0f suffix
- move RMSNormBatched into ops.h
- static constexpr -> constexpr
- concrete type instead of LayerT, WeightArrayT
- inline GetWeights
- remove if (runtime_config.verbosity
- merge AllocatePrefill and AllocateDecode
- remove bf_ffw_hidden

PiperOrigin-RevId: 644931277
2024-06-20 01:10:24 -07:00
The gemma.cpp Authors 658fb3e506 Move test placeholder to a later pos.
PiperOrigin-RevId: 644808456
2024-06-19 13:24:10 -07:00
The gemma.cpp Authors 0e612d9a20 Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions.
PiperOrigin-RevId: 644455520
2024-06-18 11:24:56 -07:00
Paul Chang d7d9d14f0e Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly
For regular (non-SSM) Gemma models, kGriffinLayers is by definition always zero
and kGemmaLayers is just the number of layers.

PiperOrigin-RevId: 644384531
2024-06-18 07:52:52 -07:00
Jan Wassenberg 70506b0a62 Fix debug_prompt and other binaries (internal init)
PiperOrigin-RevId: 644367683
2024-06-18 06:48:59 -07:00
Jan Wassenberg 15135f5b3d Simplify Attention.
Shared kMHA, reuse from Activations,
inline Attn lambda, use QDim as the stride between successive Q.

PiperOrigin-RevId: 644343854
2024-06-18 05:08:12 -07:00
Jan Wassenberg 2ac47e4a06 Fix Py binding/run_example: use GemmaEnv
PiperOrigin-RevId: 644318962
2024-06-18 03:20:22 -07:00
Jan Wassenberg a07f60c9a1 1.15x 7b sfp prefill speedup: Matmul in attention
2b bf16:
prefill 114.456 -> 115.222
decode  16.8847 -> 16.9987

7b sfp:
prefill 18.8575 -> 21.7325
decode 5.68428 -> 5.79791

PiperOrigin-RevId: 644283676
2024-06-18 01:00:51 -07:00
Jan Wassenberg 704d936764 Further simplification to ForEachTensor, thanks I.K.
PiperOrigin-RevId: 643996210
2024-06-17 07:12:26 -07:00
Jan Wassenberg 7d0720675f Move raw_weights into separate header, used mainly by compress_weights.
Fix warnings in backprop/* (include)

PiperOrigin-RevId: 643983136
2024-06-17 06:17:02 -07:00
Jan Wassenberg ad790d89d1 Fix DASSERT - TiledBatch requires at least 2 vectors.
Also use shorthand for weight types.

PiperOrigin-RevId: 643958371
2024-06-17 04:29:01 -07:00
The gemma.cpp Authors 7dbfa44794 Refactor CompressedWeights.
PiperOrigin-RevId: 643934198
2024-06-17 02:54:54 -07:00
Ray Smith e0afdfa8fb Added bias vector addition to MatMul
PiperOrigin-RevId: 643385381
2024-06-14 10:25:16 -07:00
The gemma.cpp Authors 2228055bb8 Internal change.
PiperOrigin-RevId: 643330703
2024-06-14 06:53:41 -07:00
Jan Wassenberg 29c0c574e6 Integrate matmul into FFW: 4.3x prefill speedup
```
before, bf16:
27.2929 prefill tokens / sec
17.2114 tokens / sec

after, bf16
116.496 prefill tokens / sec
17.5391 tokens / sec
```

PiperOrigin-RevId: 643328437
2024-06-14 06:32:26 -07:00
Ray Smith 198326a682 Removed now redundant non-batch matmul
PiperOrigin-RevId: 643317187
2024-06-14 05:13:36 -07:00
Andrey Vlasov b17631c95f Implement a missing (bf16, f32) tiled MatMul kernel.
PiperOrigin-RevId: 643313676
2024-06-14 04:54:40 -07:00
Jan Wassenberg d3c6a45b59 Major duplicated code reduction in test/benchmarks
Helper functions to tokenize/wrap
Move LayersOutputFunc into RuntimeConfig
AcceptFunc passes the probability
Implement StringFromType using the parser, and verify results match

PiperOrigin-RevId: 643255119
2024-06-14 00:16:25 -07:00
Jan Wassenberg c15ff9529c Reduce duplication in Config* by inheriting no-SSM
PiperOrigin-RevId: 643030629
2024-06-13 09:48:56 -07:00
Ray Smith ea525da967 Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time.
PiperOrigin-RevId: 643017973
2024-06-13 09:05:40 -07:00
The gemma.cpp Authors 1b40619864 Increase parallelism in ops_test
PiperOrigin-RevId: 643013415
2024-06-13 08:50:41 -07:00