gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Jan Wassenberg	edaf61b983	SVE build fix: avoid capturing vectors directly. Also use more V typedef instead of auto. PiperOrigin-RevId: 651423685	2024-07-11 08:43:56 -07:00
Jan Wassenberg	be765afce2	Simplify matmul: only 2 overloads Also add StoreHorizontalSumsMaybeAdd wrapper function, move MatMulSlowBatch into test. 1.02-1.06x speedup. PiperOrigin-RevId: 651394791	2024-07-11 06:58:42 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00
Jan Wassenberg	e588a7f45d	Add config for att/final cap, skip max-subtract. Fixes #278 Also update includes/deps for backprop/. PiperOrigin-RevId: 648399222	2024-07-01 09:45:26 -07:00
Jan Wassenberg	48ebba8b7a	Code cleanup - Simplify template arg list, enable deduction - missing hn:: on " Lanes" - 1.0f suffix - move RMSNormBatched into ops.h - static constexpr -> constexpr - concrete type instead of LayerT, WeightArrayT - inline GetWeights - remove if (runtime_config.verbosity - merge AllocatePrefill and AllocateDecode - remove bf_ffw_hidden PiperOrigin-RevId: 644931277	2024-06-20 01:10:24 -07:00
Jan Wassenberg	a07f60c9a1	1.15x 7b sfp prefill speedup: Matmul in attention 2b bf16: prefill 114.456 -> 115.222 decode 16.8847 -> 16.9987 7b sfp: prefill 18.8575 -> 21.7325 decode 5.68428 -> 5.79791 PiperOrigin-RevId: 644283676	2024-06-18 01:00:51 -07:00
Ray Smith	e0afdfa8fb	Added bias vector addition to MatMul PiperOrigin-RevId: 643385381	2024-06-14 10:25:16 -07:00
Jan Wassenberg	29c0c574e6	Integrate matmul into FFW: 4.3x prefill speedup ``` before, bf16: 27.2929 prefill tokens / sec 17.2114 tokens / sec after, bf16 116.496 prefill tokens / sec 17.5391 tokens / sec ``` PiperOrigin-RevId: 643328437	2024-06-14 06:32:26 -07:00
Ray Smith	198326a682	Removed now redundant non-batch matmul PiperOrigin-RevId: 643317187	2024-06-14 05:13:36 -07:00
Andrey Vlasov	b17631c95f	Implement a missing (bf16, f32) tiled MatMul kernel. PiperOrigin-RevId: 643313676	2024-06-14 04:54:40 -07:00
Jan Wassenberg	d3c6a45b59	Major duplicated code reduction in test/benchmarks Helper functions to tokenize/wrap Move LayersOutputFunc into RuntimeConfig AcceptFunc passes the probability Implement StringFromType using the parser, and verify results match PiperOrigin-RevId: 643255119	2024-06-14 00:16:25 -07:00
Ray Smith	ea525da967	Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time. PiperOrigin-RevId: 643017973	2024-06-13 09:05:40 -07:00
Andrey Vlasov	38eb452b94	Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, sfp) tiled MatMul. PiperOrigin-RevId: 642901844	2024-06-13 02:07:21 -07:00
The gemma.cpp Authors	2a0e6ee976	Fix numerical issue in Softcap by subtracting max. Also update test threshold. PiperOrigin-RevId: 642587468	2024-06-12 05:42:16 -07:00
The gemma.cpp Authors	f467670de7	Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. PiperOrigin-RevId: 642533996	2024-06-12 01:11:59 -07:00
Jan Wassenberg	3e2396f98c	Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc accept_token: allow default, check if empty when using allow mixing sample_func and stream_func, call the latter after the former Also fix missing includes/deps. PiperOrigin-RevId: 642240012	2024-06-11 05:53:10 -07:00
Daniel Keysers	c557ad23a8	Adds simple-loop versions of missing batched functions. PiperOrigin-RevId: 642189741	2024-06-11 02:14:02 -07:00
Paul Chang	6c0be20fa6	Fix Softmax on SVE PiperOrigin-RevId: 640947138	2024-06-06 10:39:30 -07:00
The gemma.cpp Authors	39d4115717	Implement mixed mode matmul: f32 * bf16 PiperOrigin-RevId: 640940962	2024-06-06 10:21:46 -07:00
Jan Wassenberg	57c2cd8b52	Simplifications: remove GemmaInterface and GemmaImpl Split common and weights into separate lib Remove common-inl (does not have to be SIMD code), activations.cc Centralize switch(Model) to avoid duplication Move CompressWeightsT to compress_weights.cc Move LoadWeights to weights.cc PiperOrigin-RevId: 640869202	2024-06-06 05:54:21 -07:00
Paul Chang	175e389c3c	revert back to HWY_ASSERT for lane constraints, qualify hn::Add PiperOrigin-RevId: 640193239	2024-06-04 10:10:18 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Zoltan Szabadka	8567978541	Adress review comments	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Paul Chang	5feacf120c	static_assert shape constraints in MatMul 4x4 PiperOrigin-RevId: 639069345	2024-05-31 10:02:45 -07:00
Phil Culliton	c616abe628	Unrolled / tiled 4x4 MatMul PiperOrigin-RevId: 638384686	2024-05-29 13:02:35 -07:00
Zoltan Szabadka	542ad0973a	Fix normalization in Softmax function.	2024-05-24 08:58:31 +00:00
Apoorv Reddy	1aaf3b3aae	Documenting the RoPE implementation. PiperOrigin-RevId: 636175297	2024-05-22 08:26:29 -07:00
Jan Wassenberg	22fe9809ac	Fix SVE build: add missing hn:: PiperOrigin-RevId: 632481097	2024-05-10 06:49:26 -07:00
Jan Wassenberg	c5c9fc300c	Enable even/odd for SFP. Refs #166 Disable it for float32 because there is not enough benefit. PiperOrigin-RevId: 631788326	2024-05-08 07:09:06 -07:00
Jan Wassenberg	f6d02b2870	Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale. Remove extra Dot() overload MatVecAdd always adds, use MatVecT<kAdd> if conditional. Remove ununsed MatVecAddLoop and MatVecLoop No longer tsan-verify even_odd PiperOrigin-RevId: 631377279	2024-05-07 04:40:42 -07:00
Phil Culliton	28ca001d5e	Matmul and test functions PiperOrigin-RevId: 630373984	2024-05-03 06:39:36 -07:00
Copybara-Service	6eeef2e2d9	Merge pull request #166 from samkaufman:deinterleave-vecs PiperOrigin-RevId: 630360778	2024-05-03 05:23:31 -07:00
Zoltan Szabadka	9a2682d544	Use more parallelism in the QKV projections of the MHA block. We compute all three projections with one MatVec and then copy the kv part to the cache. Benchmark results for 7b-it model that uses MHA blocks (summarization with 1600 tokens for prefill and essay writing with 500 tokens for generation): ``` Prefill speed Generation speed Num threads BEFORE AFTER BEFORE AFTER 32 13.75 t/s 14.80 t/s 9.22 t/s 9.77 t/s 64 19.89 t/s 24.83 t/s 12.46 t/s 13.66 t/s ```	2024-05-02 13:46:45 +00:00
Sam Kaufman	4a6173d929	Remove unused vars.	2024-05-02 00:41:44 -07:00
Sam Kaufman	564937ede6	Merge branch 'dev' into deinterleave-vecs	2024-04-30 16:23:04 -07:00
Sam Kaufman	2829ef17ad	Check for HWY_NATIVE_DOT_BF16.	2024-04-30 15:19:28 -07:00
Sam Kaufman	59ebecce22	Fix: specialized MatVecAdd was never called.	2024-04-30 15:17:27 -07:00
Jan Wassenberg	12fb2f05cf	Add per-thread even_odd storage for #166 . Also inline ProjQ and ProjKV lambdas, add missing includes/deps for ops_test. PiperOrigin-RevId: 629460608	2024-04-30 10:42:23 -07:00
Sam Kaufman	6a78a23f4c	Abstracted some MatVecAdd spec. dupes.	2024-04-29 16:23:38 -07:00
Sam Kaufman	f608337fef	Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo.	2024-04-29 14:13:07 -07:00
Sam Kaufman	aa0b113214	(VecT) to static_cast<VecT>.	2024-04-29 12:53:47 -07:00
Sam Kaufman	5cb63346aa	supports_eo -> kSupportsEvenOdd	2024-04-29 12:51:35 -07:00
Sam Kaufman	0816a1070d	Even-odd layout MatVecs for bf16 weights.	2024-04-28 20:09:25 -07:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00

48 Commits