gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	5c3e5f7038	Remove no longer required stats.h - use Highway version instead PiperOrigin-RevId: 640440379	2024-06-05 01:37:48 -07:00
Paul Chang	175e389c3c	revert back to HWY_ASSERT for lane constraints, qualify hn::Add PiperOrigin-RevId: 640193239	2024-06-04 10:10:18 -07:00
Phil Culliton	e71d82ead9	Fix for GenerateZeroMat call in TestTiledMatMul PiperOrigin-RevId: 640180868	2024-06-04 09:32:23 -07:00
Zelalem Aweke	9e213b3d96	Use system topology to pin threads across clusters. PiperOrigin-RevId: 640151974	2024-06-04 07:50:32 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Copybara-Service	25d9c8ff30	Merge pull request #203 from szabadka:backprop5 PiperOrigin-RevId: 640133430	2024-06-04 06:33:08 -07:00
Zoltan Szabadka	be1d58d4fa	Fix bazel build	2024-06-04 11:13:19 +00:00
Zoltan Szabadka	cd41a4548e	Add missing include	2024-06-04 10:29:12 +00:00
Zoltan Szabadka	df01700b54	Move the backpropagation code to its own directory	2024-06-04 10:20:16 +00:00
Zoltan Szabadka	3b4fa4a0e3	Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible.	2024-06-04 09:18:56 +00:00
Zoltan Szabadka	8567978541	Adress review comments	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	7e639856da	Fix compilation and tests for gcc	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Paul Chang	ed8f39c058	Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T PiperOrigin-RevId: 639793810	2024-06-03 08:32:29 -07:00
Jan Wassenberg	a44cbdadc2	Update to Highway 1.2 for topology/VQSelect Also fix unused-warning in compress-inl. PiperOrigin-RevId: 639116915	2024-05-31 12:29:10 -07:00
Paul Chang	5feacf120c	static_assert shape constraints in MatMul 4x4 PiperOrigin-RevId: 639069345	2024-05-31 10:02:45 -07:00
Phil Culliton	c616abe628	Unrolled / tiled 4x4 MatMul PiperOrigin-RevId: 638384686	2024-05-29 13:02:35 -07:00
Paul Chang	419dc34ed5	Generic MHA/MQA/GQA implementation PiperOrigin-RevId: 636937885	2024-05-24 09:05:53 -07:00
Copybara-Service	93c0088646	Merge pull request #194 from szabadka:softmax-fix PiperOrigin-RevId: 636848144	2024-05-24 02:48:17 -07:00
Zoltan Szabadka	542ad0973a	Fix normalization in Softmax function.	2024-05-24 08:58:31 +00:00
Apoorv Reddy	1aaf3b3aae	Documenting the RoPE implementation. PiperOrigin-RevId: 636175297	2024-05-22 08:26:29 -07:00
Paul Chang	c0643577c3	Minor internal refactoring. PiperOrigin-RevId: 635852078	2024-05-21 10:29:59 -07:00
Copybara-Service	59a1f87d63	Merge pull request #189 from google:pculliton-kaggle-ci PiperOrigin-RevId: 635811297	2024-05-21 08:13:43 -07:00
Apoorv Reddy	7f4b85d00b	Add MMLU eval to github PiperOrigin-RevId: 635495178	2024-05-20 10:20:53 -07:00
pculliton	cf347dfe35	Adds Kaggle testing to CI workflow Using a restricted Kaggle account, this code: - Adds an Ubuntu 20.04 build (required for glibc compat with Kaggle infra) - Uploads the ubuntu-20.04 build and supporting library to a Kaggle dataset using a fork of `push-kaggle-dataset` - Creates a new version of a Kaggle notebook that loads artifacts from the Kaggle Model Hub, along with the newly updated dataset, and validates a 2b-it-sfp model. - Runs the notebook and throws an error if the process does not complete, raises an exception, or produces an invalid response. Todo: add tests / capabilities to the smoke tests used by the notebook.	2024-05-17 16:06:03 -04:00
Paul Chang	cfce314715	Make BlobWriter::Add() accept const void* PiperOrigin-RevId: 634780483	2024-05-17 08:11:06 -07:00
Paul Chang	82623bdc7f	Refer to --weights rather than --compressed_weights to simplify CLI docs PiperOrigin-RevId: 634391135	2024-05-16 07:51:49 -07:00
Apoorv Reddy	8e641eb4cd	Add TTFT to TimingInfo PiperOrigin-RevId: 634378994	2024-05-16 07:16:53 -07:00
Apoorv Reddy	eb0b96e0a8	Pass most runtime parameters using const RuntimeConfig& PiperOrigin-RevId: 633572507	2024-05-14 07:04:53 -07:00
Apoorv Reddy	f1eab987d8	Store tokens/sec in auxiliary struct TimingInfo. PiperOrigin-RevId: 633108908	2024-05-13 00:04:19 -07:00
Jan Wassenberg	22fe9809ac	Fix SVE build: add missing hn:: PiperOrigin-RevId: 632481097	2024-05-10 06:49:26 -07:00
Jan Wassenberg	c5c9fc300c	Enable even/odd for SFP. Refs #166 Disable it for float32 because there is not enough benefit. PiperOrigin-RevId: 631788326	2024-05-08 07:09:06 -07:00
Paul Chang	bacba351d4	Support additional scaling PiperOrigin-RevId: 631429113	2024-05-07 08:16:25 -07:00
Jan Wassenberg	f6d02b2870	Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale. Remove extra Dot() overload MatVecAdd always adds, use MatVecT<kAdd> if conditional. Remove ununsed MatVecAddLoop and MatVecLoop No longer tsan-verify even_odd PiperOrigin-RevId: 631377279	2024-05-07 04:40:42 -07:00
Jan Wassenberg	b5a9ade75f	2x speedup of SFP decode (1.4x overall) on AVX3_DL+. Thanks @nzmichaelh for suggesting table lookups! PiperOrigin-RevId: 631337524	2024-05-07 01:46:43 -07:00
Copybara-Service	18f6d43fcc	Merge pull request #169 from xinpingwang:cmake-install PiperOrigin-RevId: 630425203	2024-05-03 10:16:46 -07:00
Wang Xinping	2c038e1285	work with cmake install	2024-05-03 23:44:12 +08:00
Copybara-Service	8ed22e52bf	Merge pull request #177 from szabadka:gemma2 PiperOrigin-RevId: 630388843	2024-05-03 07:52:27 -07:00
Zoltan Szabadka	19017fdb6d	Fix expression in DASSERT()	2024-05-03 13:54:20 +00:00
Phil Culliton	28ca001d5e	Matmul and test functions PiperOrigin-RevId: 630373984	2024-05-03 06:39:36 -07:00
Zoltan Szabadka	429eb78512	Remove unused vars.	2024-05-03 13:37:17 +00:00
Zoltan Szabadka	3d72f17261	Use more parallelism in attention block in prefill mode. Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads. This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Num threads BEFORE AFTER 32 61.76 t/s 65.08 t/s 64 89.46 t/s 98.62 t/s ```	2024-05-03 13:23:07 +00:00
Copybara-Service	6eeef2e2d9	Merge pull request #166 from samkaufman:deinterleave-vecs PiperOrigin-RevId: 630360778	2024-05-03 05:23:31 -07:00
Copybara-Service	2a71333c8a	Merge pull request #176 from szabadka:gemma3 PiperOrigin-RevId: 630131001	2024-05-02 11:41:05 -07:00
Zoltan Szabadka	9a2682d544	Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```	2024-05-02 13:46:45 +00:00
Copybara-Service	bafb8382f8	Merge pull request #175 from szabadka:gemma2 PiperOrigin-RevId: 630044058	2024-05-02 06:27:15 -07:00
Zoltan Szabadka	0afa480d90	Use more parallelism in the final output of the attention block. We use MatVec instead of MatVecLoop for the per-head dense layers, because we can parallelize more on the rows of the matrix than on the number of heads. This will be even more efficient after we rearrange the weights and can have a single MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s 64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s ```	2024-05-02 09:30:07 +00:00
Sam Kaufman	4a6173d929	Remove unused vars.	2024-05-02 00:41:44 -07:00
Sam Kaufman	564937ede6	Merge branch 'dev' into deinterleave-vecs	2024-04-30 16:23:04 -07:00
Sam Kaufman	2829ef17ad	Check for HWY_NATIVE_DOT_BF16.	2024-04-30 15:19:28 -07:00

1 2 3 4 5 ...

275 Commits All Branches Search

275 Commits

All Branches