gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Andrey Vlasov	38eb452b94	Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, sfp) tiled MatMul. PiperOrigin-RevId: 642901844	2024-06-13 02:07:21 -07:00
Daniel Keysers	6e67a6d8a9	Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. PiperOrigin-RevId: 642614278	2024-06-12 07:52:13 -07:00
Daniel Keysers	1ac9857014	Extends Transformer() to prepare for batched processing. PiperOrigin-RevId: 642603025	2024-06-12 07:01:03 -07:00
The gemma.cpp Authors	2a0e6ee976	Fix numerical issue in Softcap by subtracting max. Also update test threshold. PiperOrigin-RevId: 642587468	2024-06-12 05:42:16 -07:00
The gemma.cpp Authors	f467670de7	Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. PiperOrigin-RevId: 642533996	2024-06-12 01:11:59 -07:00
Ray Smith	bdf33c7008	Updated benchmarks.cc to recent changes to Gemma API. PiperOrigin-RevId: 642285902	2024-06-11 08:55:40 -07:00
Phil Culliton	b6565e3bf6	Update AssertClose for large matrices and add large matrix test PiperOrigin-RevId: 642277221	2024-06-11 08:22:47 -07:00
Jan Wassenberg	3e2396f98c	Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc accept_token: allow default, check if empty when using allow mixing sample_func and stream_func, call the latter after the former Also fix missing includes/deps. PiperOrigin-RevId: 642240012	2024-06-11 05:53:10 -07:00
Daniel Keysers	c557ad23a8	Adds simple-loop versions of missing batched functions. PiperOrigin-RevId: 642189741	2024-06-11 02:14:02 -07:00
Jan Wassenberg	c7f5e93136	Update benchmark with internal init PiperOrigin-RevId: 641929308	2024-06-10 09:35:16 -07:00
Copybara-Service	49d814b519	Merge pull request #224 from szabadka:cleanup PiperOrigin-RevId: 641922102	2024-06-10 09:11:13 -07:00
Jan Wassenberg	c1c6714ad4	Internal experiment PiperOrigin-RevId: 641915024	2024-06-10 08:46:10 -07:00
Zoltan Szabadka	a3a75b77f9	Use CompressedWeights<TConfig<float>> in backpropagation. kWeightsAreCompressed are removed and LoadRawWeights is moved to compress_weights.cc	2024-06-10 14:34:24 +00:00
Phil Culliton	c5bcb5438c	Fix for transpose matrix creation and additional tests PiperOrigin-RevId: 641868053	2024-06-10 05:24:04 -07:00
Jan Wassenberg	36e6915e18	Add CPU output, error if not C++17, simplify tokenizer ctor PiperOrigin-RevId: 641850879	2024-06-10 04:01:11 -07:00
Phil Culliton	d985d8b867	Shifting large matrix init to heap in ops_test.cc PiperOrigin-RevId: 641311100	2024-06-07 11:38:42 -07:00
Jan Wassenberg	f9b390b134	Support all weight types in a single binary. This changes the command line flags, but the default value retains the previous behavior. Also add a CreateGemma helper to enable extra args without interface changes. PiperOrigin-RevId: 641266411	2024-06-07 09:04:45 -07:00
Copybara-Service	24db2ff725	Merge pull request #217 from szabadka:cross-entropy PiperOrigin-RevId: 641241133	2024-06-07 07:17:35 -07:00
Daniel Keysers	06f814fc8b	Small code cleanup suggestions while reading the code. PiperOrigin-RevId: 641220788	2024-06-07 05:33:17 -07:00
Zoltan Szabadka	465998d25a	Add support for custom sampling function to runtime config. With this addition the ComputeCrossEntropy function can be moved to its own library, because now we can compute it using only the public API functions from gemma.h	2024-06-07 11:45:07 +00:00
Copybara-Service	f7ac7092d6	Merge pull request #212 from szabadka:adam2 PiperOrigin-RevId: 641182573	2024-06-07 02:25:18 -07:00
Zoltan Szabadka	c004799cdc	Add Adam optimizer. Drive-by: Fix compilation errors and tests for backprop functions.	2024-06-06 18:41:36 +00:00
Jan Wassenberg	12707ade80	Toward only using compressed weights: CompressedLayer should all be f32 when weights are f32. PiperOrigin-RevId: 640954519	2024-06-06 11:00:23 -07:00
Paul Chang	6c0be20fa6	Fix Softmax on SVE PiperOrigin-RevId: 640947138	2024-06-06 10:39:30 -07:00
The gemma.cpp Authors	39d4115717	Implement mixed mode matmul: f32 * bf16 PiperOrigin-RevId: 640940962	2024-06-06 10:21:46 -07:00
Jan Wassenberg	57c2cd8b52	Simplifications: remove GemmaInterface and GemmaImpl Split common and weights into separate lib Remove common-inl (does not have to be SIMD code), activations.cc Centralize switch(Model) to avoid duplication Move CompressWeightsT to compress_weights.cc Move LoadWeights to weights.cc PiperOrigin-RevId: 640869202	2024-06-06 05:54:21 -07:00
Jan Wassenberg	5c3e5f7038	Remove no longer required stats.h - use Highway version instead PiperOrigin-RevId: 640440379	2024-06-05 01:37:48 -07:00
Paul Chang	175e389c3c	revert back to HWY_ASSERT for lane constraints, qualify hn::Add PiperOrigin-RevId: 640193239	2024-06-04 10:10:18 -07:00
Phil Culliton	e71d82ead9	Fix for GenerateZeroMat call in TestTiledMatMul PiperOrigin-RevId: 640180868	2024-06-04 09:32:23 -07:00
Zelalem Aweke	9e213b3d96	Use system topology to pin threads across clusters. PiperOrigin-RevId: 640151974	2024-06-04 07:50:32 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Zoltan Szabadka	df01700b54	Move the backpropagation code to its own directory	2024-06-04 10:20:16 +00:00
Zoltan Szabadka	3b4fa4a0e3	Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible.	2024-06-04 09:18:56 +00:00
Zoltan Szabadka	8567978541	Adress review comments	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	7e639856da	Fix compilation and tests for gcc	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Paul Chang	ed8f39c058	Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T PiperOrigin-RevId: 639793810	2024-06-03 08:32:29 -07:00
Jan Wassenberg	a44cbdadc2	Update to Highway 1.2 for topology/VQSelect Also fix unused-warning in compress-inl. PiperOrigin-RevId: 639116915	2024-05-31 12:29:10 -07:00
Paul Chang	5feacf120c	static_assert shape constraints in MatMul 4x4 PiperOrigin-RevId: 639069345	2024-05-31 10:02:45 -07:00
Phil Culliton	c616abe628	Unrolled / tiled 4x4 MatMul PiperOrigin-RevId: 638384686	2024-05-29 13:02:35 -07:00
Paul Chang	419dc34ed5	Generic MHA/MQA/GQA implementation PiperOrigin-RevId: 636937885	2024-05-24 09:05:53 -07:00
Zoltan Szabadka	542ad0973a	Fix normalization in Softmax function.	2024-05-24 08:58:31 +00:00
Apoorv Reddy	1aaf3b3aae	Documenting the RoPE implementation. PiperOrigin-RevId: 636175297	2024-05-22 08:26:29 -07:00
Apoorv Reddy	7f4b85d00b	Add MMLU eval to github PiperOrigin-RevId: 635495178	2024-05-20 10:20:53 -07:00
Paul Chang	82623bdc7f	Refer to --weights rather than --compressed_weights to simplify CLI docs PiperOrigin-RevId: 634391135	2024-05-16 07:51:49 -07:00
Apoorv Reddy	8e641eb4cd	Add TTFT to TimingInfo PiperOrigin-RevId: 634378994	2024-05-16 07:16:53 -07:00
Apoorv Reddy	eb0b96e0a8	Pass most runtime parameters using const RuntimeConfig& PiperOrigin-RevId: 633572507	2024-05-14 07:04:53 -07:00
Apoorv Reddy	f1eab987d8	Store tokens/sec in auxiliary struct TimingInfo. PiperOrigin-RevId: 633108908	2024-05-13 00:04:19 -07:00
Jan Wassenberg	22fe9809ac	Fix SVE build: add missing hn:: PiperOrigin-RevId: 632481097	2024-05-10 06:49:26 -07:00
Jan Wassenberg	c5c9fc300c	Enable even/odd for SFP. Refs #166 Disable it for float32 because there is not enough benefit. PiperOrigin-RevId: 631788326	2024-05-08 07:09:06 -07:00
Paul Chang	bacba351d4	Support additional scaling PiperOrigin-RevId: 631429113	2024-05-07 08:16:25 -07:00
Jan Wassenberg	f6d02b2870	Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale. Remove extra Dot() overload MatVecAdd always adds, use MatVecT<kAdd> if conditional. Remove ununsed MatVecAddLoop and MatVecLoop No longer tsan-verify even_odd PiperOrigin-RevId: 631377279	2024-05-07 04:40:42 -07:00
Copybara-Service	8ed22e52bf	Merge pull request #177 from szabadka:gemma2 PiperOrigin-RevId: 630388843	2024-05-03 07:52:27 -07:00
Zoltan Szabadka	19017fdb6d	Fix expression in DASSERT()	2024-05-03 13:54:20 +00:00
Phil Culliton	28ca001d5e	Matmul and test functions PiperOrigin-RevId: 630373984	2024-05-03 06:39:36 -07:00
Zoltan Szabadka	429eb78512	Remove unused vars.	2024-05-03 13:37:17 +00:00
Zoltan Szabadka	3d72f17261	Use more parallelism in attention block in prefill mode. Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads. This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Num threads BEFORE AFTER 32 61.76 t/s 65.08 t/s 64 89.46 t/s 98.62 t/s ```	2024-05-03 13:23:07 +00:00
Copybara-Service	6eeef2e2d9	Merge pull request #166 from samkaufman:deinterleave-vecs PiperOrigin-RevId: 630360778	2024-05-03 05:23:31 -07:00
Zoltan Szabadka	9a2682d544	Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```	2024-05-02 13:46:45 +00:00
Zoltan Szabadka	0afa480d90	Use more parallelism in the final output of the attention block. We use MatVec instead of MatVecLoop for the per-head dense layers, because we can parallelize more on the rows of the matrix than on the number of heads. This will be even more efficient after we rearrange the weights and can have a single MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s 64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s ```	2024-05-02 09:30:07 +00:00
Sam Kaufman	4a6173d929	Remove unused vars.	2024-05-02 00:41:44 -07:00
Sam Kaufman	564937ede6	Merge branch 'dev' into deinterleave-vecs	2024-04-30 16:23:04 -07:00
Sam Kaufman	2829ef17ad	Check for HWY_NATIVE_DOT_BF16.	2024-04-30 15:19:28 -07:00
Sam Kaufman	59ebecce22	Fix: specialized MatVecAdd was never called.	2024-04-30 15:17:27 -07:00
Jan Wassenberg	12fb2f05cf	Add per-thread even_odd storage for #166 . Also inline ProjQ and ProjKV lambdas, add missing includes/deps for ops_test. PiperOrigin-RevId: 629460608	2024-04-30 10:42:23 -07:00
Zoltan Szabadka	f8ccb8e37c	Fix kv offset computation for MHA config.	2024-04-30 16:19:14 +00:00
Zoltan Szabadka	afaca4efa8	Use more parallelism in the QKV projections in MQA mode. Instead of MatVecLoop, we use MatVec and we combine k and v into one 2 * kQKVDim long vector so that K and V projections can be combined into one MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s 18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s 32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s 64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s ```	2024-04-30 13:10:14 +00:00
Sam Kaufman	6a78a23f4c	Abstracted some MatVecAdd spec. dupes.	2024-04-29 16:23:38 -07:00
Sam Kaufman	f608337fef	Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo.	2024-04-29 14:13:07 -07:00
Sam Kaufman	aa0b113214	(VecT) to static_cast<VecT>.	2024-04-29 12:53:47 -07:00
Sam Kaufman	5cb63346aa	supports_eo -> kSupportsEvenOdd	2024-04-29 12:51:35 -07:00
Zoltan Szabadka	27117cc39f	Simplify threading: remove the use of inner_pool. We only used inner_pool in the prefill FFW function, and there we can achieve sufficient parallelism on the rows of the matrix-vector multiplications. Benchmark results on a 1600-token summarization task: ``` Prefill speed Num threads BEFORE AFTER 4 9.24 t/s 9.76 t/s 18 31.41 t/s 31.16 t/s 32 31.41 t/s 45.13 t/s 64 31.03 t/s 57.85 t/s ```	2024-04-29 16:07:30 +00:00
Paul Chang	1d18c5a129	Improve documentation for compress_weights flags PiperOrigin-RevId: 629053191	2024-04-29 06:49:50 -07:00
Sam Kaufman	0816a1070d	Even-odd layout MatVecs for bf16 weights.	2024-04-28 20:09:25 -07:00
Paul Chang	2d4de6b08b	Support absolute positional embeddings from vanilla transformer PiperOrigin-RevId: 628100831	2024-04-25 09:32:14 -07:00
Paul Chang	75eca87039	Simplify prefill early-exit (originally Merge #156 ) PiperOrigin-RevId: 627788524	2024-04-24 11:11:42 -07:00
Charles Chan	ea45d7c4d7	Use lambda to split function and Make stream_token can break prefill, too	2024-04-23 22:55:01 +08:00
Paul Chang	e8d29792ac	New token validity assertions, improve prompt truncation warning PiperOrigin-RevId: 627376194	2024-04-23 07:05:59 -07:00
Jan Wassenberg	3bf22abb22	Fix sign comparison warnings PiperOrigin-RevId: 627299902	2024-04-23 01:16:51 -07:00
Jan Wassenberg	e9a0caed87	Further improve IO, enable multiple backends without -D. Move Path into io.h and use for opening files. Removes dependency of gemma_lib on args. Separate Windows codepath instead of emulating POSIX functions. Plus lint fixes. PiperOrigin-RevId: 626279004	2024-04-19 00:40:29 -07:00
Paul Chang	38f1ea9b80	Eliminate redundant copies of TokenString() Move this function outside of HWY_NAMESPACE since it doesn't need to be optimized for any particular architecture. PiperOrigin-RevId: 626098641	2024-04-18 11:31:50 -07:00
Jan Wassenberg	a8ceb75f43	Improved IO abstraction layer Move to unique_ptr-like File class. Move `if OS_WIN` into wrapper functions. exists -> Exists. PiperOrigin-RevId: 625923056	2024-04-17 23:15:07 -07:00
Andrey Mikhaylov	4ef3da733a	Fixed minor things and added comments.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	2c5706f159	Add comments regarding layers output usage.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	03284d752e	Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file.	2024-04-12 15:39:16 +00:00
RangerUFO	e541707caa	Rename the fields of Griffin weights	2024-04-10 21:04:31 +08:00
RangerUFO	4e960d67f6	Fix typos	2024-04-10 20:38:18 +08:00
RangerUFO	809bd0709d	Refactor data structures to reduce memory usage	2024-04-10 19:35:23 +08:00
Jan Wassenberg	881eeffe0a	Lint fixes: strcat, includes, arg naming PiperOrigin-RevId: 623435210	2024-04-10 03:12:41 -07:00
RangerUFO	2099b37732	Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs	2024-04-09 20:44:41 +08:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00

1 2 3 4 5

241 Commits