gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Zoltan Szabadka	9a2682d544	Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```	2024-05-02 13:46:45 +00:00
Copybara-Service	bafb8382f8	Merge pull request #175 from szabadka:gemma2 PiperOrigin-RevId: 630044058	2024-05-02 06:27:15 -07:00
Zoltan Szabadka	0afa480d90	Use more parallelism in the final output of the attention block. We use MatVec instead of MatVecLoop for the per-head dense layers, because we can parallelize more on the rows of the matrix than on the number of heads. This will be even more efficient after we rearrange the weights and can have a single MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 58.24 t/s 61.79 t/s 32.11 t/s 32.62 t/s 64 83.62 t/s 92.00 t/s 41.10 t/s 41.80 t/s ```	2024-05-02 09:30:07 +00:00
Jan Wassenberg	12fb2f05cf	Add per-thread even_odd storage for #166 . Also inline ProjQ and ProjKV lambdas, add missing includes/deps for ops_test. PiperOrigin-RevId: 629460608	2024-04-30 10:42:23 -07:00
Copybara-Service	8f04a8346d	Merge pull request #172 from szabadka:gemma2 PiperOrigin-RevId: 629438917	2024-04-30 09:33:38 -07:00
Zoltan Szabadka	f8ccb8e37c	Fix kv offset computation for MHA config.	2024-04-30 16:19:14 +00:00
Copybara-Service	374fd7478a	Merge pull request #170 from szabadka:gemma2 PiperOrigin-RevId: 629408279	2024-04-30 07:40:30 -07:00
Zoltan Szabadka	afaca4efa8	Use more parallelism in the QKV projections in MQA mode. Instead of MatVecLoop, we use MatVec and we combine k and v into one 2 * kQKVDim long vector so that K and V projections can be combined into one MatVec operation. Benchmark results (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 4 9.81 t/s 9.96 t/s 8.39 t/s 8.46 t/s 18 31.50 t/s 36.67 t/s 23.10 t/s 25.83 t/s 32 45.36 t/s 58.91 t/s 27.60 t/s 31.25 t/s 64 57.72 t/s 80.64 t/s 35.40 t/s 39.76 t/s ```	2024-04-30 13:10:14 +00:00
Copybara-Service	befe9fb07e	Merge pull request #167 from szabadka:gemma2 PiperOrigin-RevId: 629325219	2024-04-30 01:00:37 -07:00
Zoltan Szabadka	27117cc39f	Simplify threading: remove the use of inner_pool. We only used inner_pool in the prefill FFW function, and there we can achieve sufficient parallelism on the rows of the matrix-vector multiplications. Benchmark results on a 1600-token summarization task: ``` Prefill speed Num threads BEFORE AFTER 4 9.24 t/s 9.76 t/s 18 31.41 t/s 31.16 t/s 32 31.41 t/s 45.13 t/s 64 31.03 t/s 57.85 t/s ```	2024-04-29 16:07:30 +00:00
Paul Chang	1d18c5a129	Improve documentation for compress_weights flags PiperOrigin-RevId: 629053191	2024-04-29 06:49:50 -07:00
Jan Wassenberg	7a12e29027	Add error-checking for py binding, add missing include+hwasan check PiperOrigin-RevId: 628453112	2024-04-26 10:59:41 -07:00
Paul Chang	e8f59bb411	Fix underflow in NUQ ClusterCost() PiperOrigin-RevId: 628137904	2024-04-25 11:28:51 -07:00
Phil Culliton	9e0ac5de34	Update Clif wrapper to work with latest gemma.cpp and add simple example PiperOrigin-RevId: 628134201	2024-04-25 11:17:16 -07:00
Paul Chang	2d4de6b08b	Support absolute positional embeddings from vanilla transformer PiperOrigin-RevId: 628100831	2024-04-25 09:32:14 -07:00
Paul Chang	75eca87039	Simplify prefill early-exit (originally Merge #156 ) PiperOrigin-RevId: 627788524	2024-04-24 11:11:42 -07:00
Copybara-Service	b27d8d6b92	Merge pull request #156 from zeerd:dev PiperOrigin-RevId: 627706909	2024-04-24 06:19:14 -07:00
Charles Chan	ea45d7c4d7	Use lambda to split function and Make stream_token can break prefill, too	2024-04-23 22:55:01 +08:00
Paul Chang	e8d29792ac	New token validity assertions, improve prompt truncation warning PiperOrigin-RevId: 627376194	2024-04-23 07:05:59 -07:00
Jan Wassenberg	3bf22abb22	Fix sign comparison warnings PiperOrigin-RevId: 627299902	2024-04-23 01:16:51 -07:00
Jan Wassenberg	ca971ef50f	Document weight conversion PiperOrigin-RevId: 626957718	2024-04-22 01:58:30 -07:00
Jan Wassenberg	e9a0caed87	Further improve IO, enable multiple backends without -D. Move Path into io.h and use for opening files. Removes dependency of gemma_lib on args. Separate Windows codepath instead of emulating POSIX functions. Plus lint fixes. PiperOrigin-RevId: 626279004	2024-04-19 00:40:29 -07:00
Paul Chang	38f1ea9b80	Eliminate redundant copies of TokenString() Move this function outside of HWY_NAMESPACE since it doesn't need to be optimized for any particular architecture. PiperOrigin-RevId: 626098641	2024-04-18 11:31:50 -07:00
Jan Wassenberg	a8ceb75f43	Improved IO abstraction layer Move to unique_ptr-like File class. Move `if OS_WIN` into wrapper functions. exists -> Exists. PiperOrigin-RevId: 625923056	2024-04-17 23:15:07 -07:00
Jan Wassenberg	a939b5fc9f	Update distortion.h to weighted average, add distortion_test. More thorough checks in sfp_test and nuq_test. nuq_test: use deterministic input generator. PiperOrigin-RevId: 625602019	2024-04-17 01:44:19 -07:00
Copybara-Service	05e7e2b2bb	Merge pull request #145 from atorero:dev PiperOrigin-RevId: 624221085	2024-04-12 10:27:18 -07:00
Andrey Mikhaylov	4ef3da733a	Fixed minor things and added comments.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	2c5706f159	Add comments regarding layers output usage.	2024-04-12 15:39:16 +00:00
Andrey Mikhaylov	03284d752e	Added layers output functionality to gemma and a binary debug_output to save the outputs to a json file.	2024-04-12 15:39:16 +00:00
Copybara-Service	342e998cb6	Merge pull request #142 from ufownl:refactor/data_structures PiperOrigin-RevId: 623503486	2024-04-10 08:35:18 -07:00
RangerUFO	e541707caa	Rename the fields of Griffin weights	2024-04-10 21:04:31 +08:00
RangerUFO	4e960d67f6	Fix typos	2024-04-10 20:38:18 +08:00
RangerUFO	809bd0709d	Refactor data structures to reduce memory usage	2024-04-10 19:35:23 +08:00
Jan Wassenberg	54120a5571	Mention Makefile contributed by @jart PiperOrigin-RevId: 623436818	2024-04-10 03:21:10 -07:00
Jan Wassenberg	881eeffe0a	Lint fixes: strcat, includes, arg naming PiperOrigin-RevId: 623435210	2024-04-10 03:12:41 -07:00
Copybara-Service	da91f4c4be	Merge pull request #137 from zond:main PiperOrigin-RevId: 623255639	2024-04-09 12:57:57 -07:00
Copybara-Service	827fec1904	Merge pull request #139 from ufownl:feature/public_layers PiperOrigin-RevId: 623254705	2024-04-09 12:54:23 -07:00
RangerUFO	2099b37732	Change `NumGemmaLayers` and `NumGriffinLayers` to constants in configs	2024-04-09 20:44:41 +08:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00
zond	9ca662dc14	Clarified README Made it more visible that the recurrent weights are at a different Kaggle page.	2024-04-09 09:58:47 +02:00
Copybara-Service	83dd08ac87	Merge pull request #136 from pculliton:griffin PiperOrigin-RevId: 623054233	2024-04-08 22:29:24 -07:00
Luca Versari	9c3f969405	Implement the Griffin model. Also implement support for some model variations: - Local attention. - Add support for biases. - Use RoPE only on half vectors. - Support different order of QKV weights. Co-authored-by: Andrey Mikhaylov <amik@google.com> Co-authored-by: Martin Bruse <zondolfin@gmail.com> Co-authored-by: Zoltan Szabadka <szabadka@google.com>	2024-04-08 21:45:54 +02:00
Jan Wassenberg	4326249d0a	Fix includes PiperOrigin-RevId: 622456877	2024-04-06 09:27:09 -07:00
Jan Wassenberg	a3a0f78fda	Merge pull request #131 from veluca93:benchmark-and-test PiperOrigin-RevId: 622452794	2024-04-06 18:06:03 +02:00
Jan Wassenberg	9e51a91cfc	Faster bazel builds by only building all local targets. PiperOrigin-RevId: 622442126	2024-04-06 18:05:49 +02:00
Luca Versari	5862d1f995	Add a benchmark and additional tests. Also add a script to help running sanitizer builds, and do some cleanup. Co-authored-by: Andrey Mikhaylov <amik@google.com> Co-authored-by: Eugene Kliuchnikov <eustas@google.com> Co-authored-by: Sami Boukortt <sboukortt@google.com> Co-authored-by: Zoltan Szabadka <szabadka@google.com>	2024-04-06 12:54:52 +02:00
Jan Wassenberg	d852cf5089	Remove unused includes PiperOrigin-RevId: 622412150	2024-04-06 03:13:43 -07:00
Copybara-Service	325ef06cf9	Merge pull request #130 from veluca93:weight-handling PiperOrigin-RevId: 622405491	2024-04-06 02:22:00 -07:00
Luca Versari	4c23932289	Improve weight handling. - Allow scaling of SFP weights - Allow using uncompressed weights - Do not try to compress weights in the main model calls - Reduce code duplication in weight handling with some macros Co-authored-by: Eugene Kliuchnikov <eustas@google.com> Co-authored-by: Thomas Fischbacher <tfish@google.com> Co-authored-by: Zoltan Szabadka <szabadka@google.com>	2024-04-06 11:08:47 +02:00
Copybara-Service	280b8cb8a1	Merge pull request #129 from veluca93:more-ops PiperOrigin-RevId: 622145499	2024-04-05 05:02:00 -07:00

1 2 3 4 5

222 Commits All Branches Search

222 Commits

All Branches