gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Keysers	a8e08778d4	Add an additional QueryModel() overload to GemmaEnv. Use args only in GemmaEnv constructor, store everything else in RuntimeConfig. Add runtime option to turn off thread spinning. PiperOrigin-RevId: 670467320	2024-09-03 02:25:19 -07:00
Zoltan Szabadka	f6abbab3a4	Fix asan failure in local attention computation. PiperOrigin-RevId: 670207380	2024-09-02 07:06:10 -07:00
Jan Wassenberg	4033ed9e78	Avoid duplication of RMSNorm, support all activation/weight types Add test for RMSNorm Rename VectorizedRopeAndMulBy -> RopeAndMulBy Move test_util to util/ PiperOrigin-RevId: 668332927	2024-08-28 01:26:55 -07:00
Daniel Keysers	18e6012872	Fix prefill for batched queries. This lets gemma_test/GeographyBatched pass now also for gemma2-27B. PiperOrigin-RevId: 664827485	2024-08-19 08:50:42 -07:00
Apoorv Reddy	c6eb3b6f0d	VectorizedRopeAndMulBy. ~8x reduction (tested on few prompts) in Rope. ~3.8% prefill latency improvement. ~2.6% decode latency improvement. PiperOrigin-RevId: 664650108	2024-08-18 23:17:01 -07:00
Paul Chang	773333e5be	Expose underlying model configuration: number of layers, heads, etc. PiperOrigin-RevId: 663747853	2024-08-16 09:03:24 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
Paul Chang	b9ed12a325	Support directly observing activations, partially replacing LayersOutputFunc LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs. Instead, we directly expose the Activations structure. PiperOrigin-RevId: 663409316	2024-08-15 12:39:07 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
Copybara-Service	6763afcd1c	Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen PiperOrigin-RevId: 662533529	2024-08-13 08:51:06 -07:00
RangerUFO	8c634f6486	Fix the position calculation issue in the generation phase	2024-08-12 18:50:23 +02:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
Jan Wassenberg	282f73ec2f	Add pin flag to disable pinning. Refs #338 PiperOrigin-RevId: 661389171	2024-08-09 13:47:12 -07:00
Apoorv Reddy	fd1b0743a7	Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. This is to make it clear that these models are part of the Gemma2 family of models. PiperOrigin-RevId: 661181682	2024-08-09 02:09:06 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Phil Culliton	1982a6ba00	Internal change PiperOrigin-RevId: 657831926	2024-07-30 20:24:54 -07:00
Jan Wassenberg	a24eda8d02	Split matmul into matvec; add large matrix benchmark Rename var names to row/col for more clarity. Better estimate error tolerance via max abs col sum. PiperOrigin-RevId: 657601791	2024-07-30 08:29:11 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	8b4915f321	Fix Windows build - macro conflict with param name PiperOrigin-RevId: 657518587	2024-07-30 03:22:32 -07:00
Jan Wassenberg	6ea4232b2e	MatMul cleanup: Mat struct, simplify args. Add large benchmark to test, use 4 threads, skip some targets. Also use Traits::Name instead of typeid. PiperOrigin-RevId: 657496185	2024-07-30 01:55:50 -07:00
Jan Wassenberg	f27683152c	1.05x prefill speedup: matvec -> matmul for !MHA Also add C_stride and make shape normal non-template arguments. PiperOrigin-RevId: 657285945	2024-07-29 12:18:06 -07:00
Jan Wassenberg	2721f54446	Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup PiperOrigin-RevId: 657167257	2024-07-29 05:34:26 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
Daniel Keysers	2346b5a434	Minor polishing: adding comments, renaming variables. PiperOrigin-RevId: 655235006	2024-07-23 11:17:44 -07:00
Daniel Keysers	33334ad454	Fix msan uninitialized scale in optimize_test PiperOrigin-RevId: 654817460	2024-07-22 10:50:25 -07:00
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00
Jan Wassenberg	5844e6a1e5	Cleanup: add wrapper functions and rename vars to interleaved Simplifies the TransformerLayer function. Use interleaved* instead of _and_queries. PiperOrigin-RevId: 653929449	2024-07-19 02:04:11 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Jan Wassenberg	3fe79b3876	Fix msan uninitialized scale PiperOrigin-RevId: 653655471	2024-07-18 09:42:31 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Daniel Keysers	5a751a9a44	Update gemma-27b to the correct query scaling. PiperOrigin-RevId: 653201646	2024-07-17 05:43:52 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Jan Wassenberg	cd530374b3	Further 1.02x prefill speedup from batch 64->512 Measured on SKX. Larger speedup expected for Zen4/SPR. PiperOrigin-RevId: 652472928	2024-07-15 07:26:10 -07:00
The gemma.cpp Authors	c879133a5a	Increase the prefill batch size to 64. PiperOrigin-RevId: 651754772	2024-07-12 06:28:37 -07:00
The gemma.cpp Authors	df3fb70802	Improve readability with RepeatedAttentionWindowSizes PiperOrigin-RevId: 651431738	2024-07-11 09:11:46 -07:00
Jan Wassenberg	edaf61b983	SVE build fix: avoid capturing vectors directly. Also use more V typedef instead of auto. PiperOrigin-RevId: 651423685	2024-07-11 08:43:56 -07:00
Jan Wassenberg	be765afce2	Simplify matmul: only 2 overloads Also add StoreHorizontalSumsMaybeAdd wrapper function, move MatMulSlowBatch into test. 1.02-1.06x speedup. PiperOrigin-RevId: 651394791	2024-07-11 06:58:42 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00
Kan Wu	f519ab6693	Refactor configurables. PiperOrigin-RevId: 651259154	2024-07-10 21:30:58 -07:00
Andrey Vlasov	960ff4b4ec	Record time measurements in MatMul tests. PiperOrigin-RevId: 651060711	2024-07-10 10:04:40 -07:00
Daniel Keysers	063bbaa683	Add more comments to attention computation (and some small restructuring). PiperOrigin-RevId: 650929097	2024-07-10 02:39:07 -07:00
Jan Wassenberg	6a3f7cf3ea	Lint fix - string append, remove stale TODO PiperOrigin-RevId: 650197468	2024-07-08 04:11:21 -07:00
Jan Wassenberg	cbb67b4ee0	Move benchmark_helper to evals/, weights_raw to compression/. PiperOrigin-RevId: 650155983	2024-07-08 01:13:23 -07:00
Jan Wassenberg	438b1bace2	Fix handling of %c and %q if eot_string. Fixes #283 , thanks @ljcucc PiperOrigin-RevId: 649651535	2024-07-05 07:54:00 -07:00
Jan Wassenberg	118e802b00	Fix gemma_test - moved to evals/. PiperOrigin-RevId: 649338633	2024-07-04 02:04:05 -07:00
Jan Wassenberg	c7c3daa624	7x compile time speedup: shard gemma.cc Use overloaded functions defined in gemma/instantiations. Also split out activations.h. PiperOrigin-RevId: 649053122	2024-07-03 06:35:04 -07:00

1 2 3 4

181 Commits