gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Daniel Keysers	18e6012872	Fix prefill for batched queries. This lets gemma_test/GeographyBatched pass now also for gemma2-27B. PiperOrigin-RevId: 664827485	2024-08-19 08:50:42 -07:00
Apoorv Reddy	c6eb3b6f0d	VectorizedRopeAndMulBy. ~8x reduction (tested on few prompts) in Rope. ~3.8% prefill latency improvement. ~2.6% decode latency improvement. PiperOrigin-RevId: 664650108	2024-08-18 23:17:01 -07:00
Paul Chang	773333e5be	Expose underlying model configuration: number of layers, heads, etc. PiperOrigin-RevId: 663747853	2024-08-16 09:03:24 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
The gemma.cpp Authors	6c57feb52f	Automated Code Change PiperOrigin-RevId: 663622838	2024-08-16 00:01:24 -07:00
Paul Chang	b9ed12a325	Support directly observing activations, partially replacing LayersOutputFunc LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs. Instead, we directly expose the Activations structure. PiperOrigin-RevId: 663409316	2024-08-15 12:39:07 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
Copybara-Service	6763afcd1c	Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen PiperOrigin-RevId: 662533529	2024-08-13 08:51:06 -07:00
RangerUFO	8c634f6486	Fix the position calculation issue in the generation phase	2024-08-12 18:50:23 +02:00
RangerUFO	ea72575e56	Fix build issues when tests are enabled	2024-08-12 18:50:23 +02:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Jan Wassenberg	8e028632f7	0.98x prefill: refactor in prep for cache blocking. Slower because we now init tiles of C and accumulate into them. Also remove unused var in optimize_test and use BF16 typedef. PiperOrigin-RevId: 662115916	2024-08-12 09:26:29 -07:00
Daniel Keysers	7316ee8f96	Fix gemma_test GeographyBatched for 2b-it and add entropy expectations for gemma2-2b-it. PiperOrigin-RevId: 662072395	2024-08-12 07:12:46 -07:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
Jan Wassenberg	282f73ec2f	Add pin flag to disable pinning. Refs #338 PiperOrigin-RevId: 661389171	2024-08-09 13:47:12 -07:00
Apoorv Reddy	fd1b0743a7	Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. This is to make it clear that these models are part of the Gemma2 family of models. PiperOrigin-RevId: 661181682	2024-08-09 02:09:06 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	4154f5a910	Document Gemma 2 model names PiperOrigin-RevId: 659858832	2024-08-06 01:44:15 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Jan Wassenberg	1617e1a33d	SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill 12->9 ops by recognizing the upper/lower bytes are simply shifted. PiperOrigin-RevId: 659609241	2024-08-05 10:59:13 -07:00
Phil Culliton	1982a6ba00	Internal change PiperOrigin-RevId: 657831926	2024-07-30 20:24:54 -07:00
Jan Wassenberg	a24eda8d02	Split matmul into matvec; add large matrix benchmark Rename var names to row/col for more clarity. Better estimate error tolerance via max abs col sum. PiperOrigin-RevId: 657601791	2024-07-30 08:29:11 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	8b4915f321	Fix Windows build - macro conflict with param name PiperOrigin-RevId: 657518587	2024-07-30 03:22:32 -07:00
Jan Wassenberg	6ea4232b2e	MatMul cleanup: Mat struct, simplify args. Add large benchmark to test, use 4 threads, skip some targets. Also use Traits::Name instead of typeid. PiperOrigin-RevId: 657496185	2024-07-30 01:55:50 -07:00
Thomas Fischbacher	d9f86f8e4d	Add Python code for converting Griffin Orbax weights. Refs #301 PiperOrigin-RevId: 657296255	2024-07-29 12:53:30 -07:00
Jan Wassenberg	f27683152c	1.05x prefill speedup: matvec -> matmul for !MHA Also add C_stride and make shape normal non-template arguments. PiperOrigin-RevId: 657285945	2024-07-29 12:18:06 -07:00
Jan Wassenberg	2721f54446	Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup PiperOrigin-RevId: 657167257	2024-07-29 05:34:26 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
The gemma.cpp Authors	c1f243c351	Fix setting scales in Py binding PiperOrigin-RevId: 655284183	2024-07-23 13:32:50 -07:00
Daniel Keysers	2346b5a434	Minor polishing: adding comments, renaming variables. PiperOrigin-RevId: 655235006	2024-07-23 11:17:44 -07:00
Daniel Keysers	33334ad454	Fix msan uninitialized scale in optimize_test PiperOrigin-RevId: 654817460	2024-07-22 10:50:25 -07:00
The gemma.cpp Authors	74a6dc8f33	Use all CPU sockets when pinning threads to cores PiperOrigin-RevId: 654800375	2024-07-22 10:09:16 -07:00
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00
Jan Wassenberg	5844e6a1e5	Cleanup: add wrapper functions and rename vars to interleaved Simplifies the TransformerLayer function. Use interleaved* instead of _and_queries. PiperOrigin-RevId: 653929449	2024-07-19 02:04:11 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Jan Wassenberg	3fe79b3876	Fix msan uninitialized scale PiperOrigin-RevId: 653655471	2024-07-18 09:42:31 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Daniel Keysers	5a751a9a44	Update gemma-27b to the correct query scaling. PiperOrigin-RevId: 653201646	2024-07-17 05:43:52 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Paul Chang	48b900b1b9	Fix examples/hello_world for real. PiperOrigin-RevId: 652509319	2024-07-15 09:38:52 -07:00
Jan Wassenberg	cd530374b3	Further 1.02x prefill speedup from batch 64->512 Measured on SKX. Larger speedup expected for Zen4/SPR. PiperOrigin-RevId: 652472928	2024-07-15 07:26:10 -07:00
Paul Chang	aaee666a1d	Fix gemma_cpp/examples/hello_world build. Include Bazel build rules, too. PiperOrigin-RevId: 652469406	2024-07-15 07:11:01 -07:00
The gemma.cpp Authors	c879133a5a	Increase the prefill batch size to 64. PiperOrigin-RevId: 651754772	2024-07-12 06:28:37 -07:00
The gemma.cpp Authors	df3fb70802	Improve readability with RepeatedAttentionWindowSizes PiperOrigin-RevId: 651431738	2024-07-11 09:11:46 -07:00
Jan Wassenberg	edaf61b983	SVE build fix: avoid capturing vectors directly. Also use more V typedef instead of auto. PiperOrigin-RevId: 651423685	2024-07-11 08:43:56 -07:00
Jan Wassenberg	be765afce2	Simplify matmul: only 2 overloads Also add StoreHorizontalSumsMaybeAdd wrapper function, move MatMulSlowBatch into test. 1.02-1.06x speedup. PiperOrigin-RevId: 651394791	2024-07-11 06:58:42 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00

... 3 4 5 6 7 ...

621 Commits All Branches Search

621 Commits

All Branches