gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Zoltan Szabadka	3d72f17261	Use more parallelism in attention block in prefill mode. Move the loop over the tokens inside the attention block and then create kHeads * num_tokens threads. This helps the multi-threaded speed only in case of the 2b gemma model, but to be consistent we move the loop over the tokens inside the griffin recurrent layer and the FFW layer as well. This is also a preparation for using the MatMul operation later. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Num threads BEFORE AFTER 32 61.76 t/s 65.08 t/s 64 89.46 t/s 98.62 t/s ```	2024-05-03 13:23:07 +00:00
Copybara-Service	6eeef2e2d9	Merge pull request #166 from samkaufman:deinterleave-vecs PiperOrigin-RevId: 630360778	2024-05-03 05:23:31 -07:00
Copybara-Service	2a71333c8a	Merge pull request #176 from szabadka:gemma3 PiperOrigin-RevId: 630131001	2024-05-02 11:41:05 -07:00
Zoltan Szabadka	9a2682d544	Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```	2024-05-02 13:46:45 +00:00
Copybara-Service	bafb8382f8	Merge pull request #175 from szabadka:gemma2 PiperOrigin-RevId: 630044058	2024-05-02 06:27:15 -07:00
Zoltan Szabadka	0afa480d90	Use more parallelism in the final output of the attention block. We use MatVec instead of MatVecLoop for the per-head dense layers, because we can parallelize more on the rows of the matrix than on the number of heads. This will be even more efficient after we rearrange the weights and can have a single MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s 64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s ```	2024-05-02 09:30:07 +00:00
Sam Kaufman	4a6173d929	Remove unused vars.	2024-05-02 00:41:44 -07:00
Sam Kaufman	564937ede6	Merge branch 'dev' into deinterleave-vecs	2024-04-30 16:23:04 -07:00
Sam Kaufman	2829ef17ad	Check for HWY_NATIVE_DOT_BF16.	2024-04-30 15:19:28 -07:00
Sam Kaufman	59ebecce22	Fix: specialized MatVecAdd was never called.	2024-04-30 15:17:27 -07:00
Jan Wassenberg	12fb2f05cf	Add per-thread even_odd storage for #166 . Also inline ProjQ and ProjKV lambdas, add missing includes/deps for ops_test. PiperOrigin-RevId: 629460608	2024-04-30 10:42:23 -07:00
Copybara-Service	8f04a8346d	Merge pull request #172 from szabadka:gemma2 PiperOrigin-RevId: 629438917	2024-04-30 09:33:38 -07:00
Zoltan Szabadka	f8ccb8e37c	Fix kv offset computation for MHA config.	2024-04-30 16:19:14 +00:00
Copybara-Service	374fd7478a	Merge pull request #170 from szabadka:gemma2 PiperOrigin-RevId: 629408279	2024-04-30 07:40:30 -07:00
Zoltan Szabadka	afaca4efa8	Use more parallelism in the QKV projections in MQA mode. Instead of MatVecLoop, we use MatVec and we combine k and v into one 2 * kQKVDim long vector so that K and V projections can be combined into one MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s 18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s 32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s 64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s ```	2024-04-30 13:10:14 +00:00
Copybara-Service	befe9fb07e	Merge pull request #167 from szabadka:gemma2 PiperOrigin-RevId: 629325219	2024-04-30 01:00:37 -07:00
Sam Kaufman	6a78a23f4c	Abstracted some MatVecAdd spec. dupes.	2024-04-29 16:23:38 -07:00
Sam Kaufman	f608337fef	Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo.	2024-04-29 14:13:07 -07:00
Sam Kaufman	aa0b113214	(VecT) to static_cast<VecT>.	2024-04-29 12:53:47 -07:00
Sam Kaufman	5cb63346aa	supports_eo -> kSupportsEvenOdd	2024-04-29 12:51:35 -07:00
Zoltan Szabadka	27117cc39f	Simplify threading: remove the use of inner_pool. We only used inner_pool in the prefill FFW function, and there we can achieve sufficient parallelism on the rows of the matrix-vector multiplications. Benchmark results on a 1600-token summarization task: ``` Prefill speed Num threads BEFORE AFTER 4 9.24 t/s 9.76 t/s 18 31.41 t/s 31.16 t/s 32 31.41 t/s 45.13 t/s 64 31.03 t/s 57.85 t/s ```	2024-04-29 16:07:30 +00:00
Paul Chang	1d18c5a129	Improve documentation for compress_weights flags PiperOrigin-RevId: 629053191	2024-04-29 06:49:50 -07:00
Sam Kaufman	0816a1070d	Even-odd layout MatVecs for bf16 weights.	2024-04-28 20:09:25 -07:00
Jan Wassenberg	7a12e29027	Add error-checking for py binding, add missing include+hwasan check PiperOrigin-RevId: 628453112	2024-04-26 10:59:41 -07:00
Paul Chang	e8f59bb411	Fix underflow in NUQ ClusterCost() PiperOrigin-RevId: 628137904	2024-04-25 11:28:51 -07:00
Phil Culliton	9e0ac5de34	Update Clif wrapper to work with latest gemma.cpp and add simple example PiperOrigin-RevId: 628134201	2024-04-25 11:17:16 -07:00
Paul Chang	2d4de6b08b	Support absolute positional embeddings from vanilla transformer PiperOrigin-RevId: 628100831	2024-04-25 09:32:14 -07:00
Paul Chang	75eca87039	Simplify prefill early-exit (originally Merge #156 ) PiperOrigin-RevId: 627788524	2024-04-24 11:11:42 -07:00
Copybara-Service	b27d8d6b92	Merge pull request #156 from zeerd:dev PiperOrigin-RevId: 627706909	2024-04-24 06:19:14 -07:00
Charles Chan	ea45d7c4d7	Use lambda to split function and Make stream_token can break prefill, too	2024-04-23 22:55:01 +08:00
Paul Chang	e8d29792ac	New token validity assertions, improve prompt truncation warning PiperOrigin-RevId: 627376194	2024-04-23 07:05:59 -07:00
Jan Wassenberg	3bf22abb22	Fix sign comparison warnings PiperOrigin-RevId: 627299902	2024-04-23 01:16:51 -07:00
Jan Wassenberg	ca971ef50f	Document weight conversion PiperOrigin-RevId: 626957718	2024-04-22 01:58:30 -07:00
Jan Wassenberg	e9a0caed87	Further improve IO, enable multiple backends without -D. Move Path into io.h and use for opening files. Removes dependency of gemma_lib on args. Separate Windows codepath instead of emulating POSIX functions. Plus lint fixes. PiperOrigin-RevId: 626279004	2024-04-19 00:40:29 -07:00
Paul Chang	38f1ea9b80	Eliminate redundant copies of TokenString() Move this function outside of HWY_NAMESPACE since it doesn't need to be optimized for any particular architecture. PiperOrigin-RevId: 626098641	2024-04-18 11:31:50 -07:00
Jan Wassenberg	a8ceb75f43	Improved IO abstraction layer Move to unique_ptr-like File class. Move `if OS_WIN` into wrapper functions. exists -> Exists. PiperOrigin-RevId: 625923056	2024-04-17 23:15:07 -07:00
Jan Wassenberg	a939b5fc9f	Update distortion.h to weighted average, add distortion_test. More thorough checks in sfp_test and nuq_test. nuq_test: use deterministic input generator. PiperOrigin-RevId: 625602019	2024-04-17 01:44:19 -07:00
Copybara-Service	05e7e2b2bb	Merge pull request #145 from atorero:dev PiperOrigin-RevId: 624221085	2024-04-12 10:27:18 -07:00
Andrey Mikhaylov	4ef3da733a	Fixed minor things and added comments.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	2c5706f159	Add comments regarding layers output usage.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	03284d752e	Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file.	2024-04-12 15:39:16 +00:00
Copybara-Service	342e998cb6	Merge pull request #142 from ufownl:refactor/data_structures PiperOrigin-RevId: 623503486	2024-04-10 08:35:18 -07:00
RangerUFO	e541707caa	Rename the fields of Griffin weights	2024-04-10 21:04:31 +08:00
RangerUFO	4e960d67f6	Fix typos	2024-04-10 20:38:18 +08:00
RangerUFO	809bd0709d	Refactor data structures to reduce memory usage	2024-04-10 19:35:23 +08:00
Jan Wassenberg	54120a5571	Mention Makefile contributed by @jart PiperOrigin-RevId: 623436818	2024-04-10 03:21:10 -07:00
Jan Wassenberg	881eeffe0a	Lint fixes: strcat, includes, arg naming PiperOrigin-RevId: 623435210	2024-04-10 03:12:41 -07:00
Copybara-Service	da91f4c4be	Merge pull request #137 from zond:main PiperOrigin-RevId: 623255639	2024-04-09 12:57:57 -07:00
Copybara-Service	827fec1904	Merge pull request #139 from ufownl:feature/public_layers PiperOrigin-RevId: 623254705	2024-04-09 12:54:23 -07:00
RangerUFO	2099b37732	Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs	2024-04-09 20:44:41 +08:00

... 3 4 5 6 7 ...

434 Commits All Branches Search

434 Commits

All Branches