gemma.cpp

Commit Graph

Author	SHA1	Message	Date
RangerUFO	ed88115e6a	Fix compilation error of the weights compression tool	2024-10-11 18:55:06 +08:00
Jan Wassenberg	6ab3ff5bde	Minor cleanup, Windows+Bazel build fixes add app.h comment compress-inl: remove unused typedef gemma-inl: add missing HWY_ATTR and cast separate sum-inl.h and basics.h headers replace more hwy::bfloat16_t with BF16 update include pragmas update dot_test thresholds update Highway version in Bazel for HWY_RCAST_ALIGNED fix PiperOrigin-RevId: 684464326	2024-10-10 09:05:06 -07:00
Ray Smith	85958f5fd3	Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. Definition of array size is moved to the constructor. Allocation is separate and parallelized. All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted. Replaced all previous ForEachTensor functions with a single unified function. PiperOrigin-RevId: 684451604	2024-10-10 08:22:30 -07:00
Jan Wassenberg	bd53b0f7c3	Fix MSAN issue for multiturn. Rewind the prior EOS token. Also move MaybeCheckInitialized to allocator.h PiperOrigin-RevId: 683187458	2024-10-07 08:07:54 -07:00
Jan Wassenberg	5a71d819cb	Also enable f64 dot/sum for <f32 inputs Add bf16 support to Dot/SumKernelDouble in the same way as *Compensated. PiperOrigin-RevId: 682308683	2024-10-04 07:12:10 -07:00
Jan Wassenberg	5e812f07f5	Use f64 Dot and sum in softmax - faster than Cascaded Also let the kernel specify the Raw and State types, rename WeightT/VecT -> WT/VT. PiperOrigin-RevId: 680464427	2024-09-30 01:22:09 -07:00
Jan Wassenberg	47eb80a90e	Add double-precision dot variant PiperOrigin-RevId: 679243590	2024-09-26 12:09:10 -07:00
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Jan Wassenberg	cdbfebb10f	Fix compress-inl bf16->f32 overrun Caught by Arm hwasan but not x86 asan. PiperOrigin-RevId: 677779421	2024-09-23 07:10:25 -07:00
Jan Wassenberg	35fdf848c7	Cascaded summation for Softmax This can affect generation results after a few hundred tokens. Also remove profiler from DecompressAndCall, use Add instead of +=, use PackedSpan for args and remove alignment requirement. Changing accumulation order in AssimilateCascadedSums updates dot_test thresholds. PiperOrigin-RevId: 676891797	2024-09-20 10:31:23 -07:00
Copybara-Service	09bc8d62cc	Merge pull request #380 from ufownl:bugfix/threading PiperOrigin-RevId: 676799495	2024-09-20 04:52:48 -07:00
Daniel Keysers	1c8ddcdffe	Adds insert_float() to SbsWriter() to store a float array directly. PiperOrigin-RevId: 673982528	2024-09-12 13:27:24 -07:00
Jan Wassenberg	13a9f76f64	Fix mismatch between blob_store and compress interfaces (bytes) PiperOrigin-RevId: 673027268	2024-09-10 10:59:17 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Jan Wassenberg	5c0da8c8c3	Minor cleanup/fixes: - optimize_test simplify prompt check - Fix SFP arg case - Fix includes - Align inputs in test - IsInside: add DASSERT - Fix PerClusterPool NumThreads PiperOrigin-RevId: 672530385	2024-09-09 06:58:09 -07:00
Jan Wassenberg	c29e9752c7	Refactor/cleanup, remove even_odd * New compression/shared.h, remove sfp.h * Remove unused DistortionStats b_l1_ * Move exact arithmetic functions into fp_arith * Remove even_odd optimization for MatVec (mostly unused) * use BF16 typedef more widely * Add kMaxSFP constant PiperOrigin-RevId: 670996386	2024-09-04 09:25:13 -07:00
Jan Wassenberg	07c34cb18a	Further nuq_test speedups to prevent timeout PiperOrigin-RevId: 670863385	2024-09-04 00:49:44 -07:00
Jan Wassenberg	9661b81c4b	Fix NUQ for SVE - incorrect nibble packing Also speed up test PiperOrigin-RevId: 670625545	2024-09-03 10:59:01 -07:00
Jan Wassenberg	aa11ddf5fc	1.22x NUQ compress speedup, fix out of bounds access, improve numerics Also clarify the cost computation and move toward non-group-multiple num. PiperOrigin-RevId: 670544245	2024-09-03 07:10:56 -07:00
Jan Wassenberg	4033ed9e78	Avoid duplication of RMSNorm, support all activation/weight types Add test for RMSNorm Rename VectorizedRopeAndMulBy -> RopeAndMulBy Move test_util to util/ PiperOrigin-RevId: 668332927	2024-08-28 01:26:55 -07:00
Jan Wassenberg	2308514e5a	Experiment with compensated dot product. ULP difference vs exact is 0..1, vs 200-5000 for previous. Runtime overhead is 2.5-4x for f32 input. PiperOrigin-RevId: 668084019	2024-08-27 12:05:35 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
The gemma.cpp Authors	6c57feb52f	Automated Code Change PiperOrigin-RevId: 663622838	2024-08-16 00:01:24 -07:00
Jan Wassenberg	8e028632f7	0.98x prefill: refactor in prep for cache blocking. Slower because we now init tiles of C and accumulate into them. Also remove unused var in optimize_test and use BF16 typedef. PiperOrigin-RevId: 662115916	2024-08-12 09:26:29 -07:00
Jan Wassenberg	1617e1a33d	SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill 12->9 ops by recognizing the upper/lower bytes are simply shifted. PiperOrigin-RevId: 659609241	2024-08-05 10:59:13 -07:00
Jan Wassenberg	6ea4232b2e	MatMul cleanup: Mat struct, simplify args. Add large benchmark to test, use 4 threads, skip some targets. Also use Traits::Name instead of typeid. PiperOrigin-RevId: 657496185	2024-07-30 01:55:50 -07:00
Thomas Fischbacher	d9f86f8e4d	Add Python code for converting Griffin Orbax weights. Refs #301 PiperOrigin-RevId: 657296255	2024-07-29 12:53:30 -07:00
The gemma.cpp Authors	c1f243c351	Fix setting scales in Py binding PiperOrigin-RevId: 655284183	2024-07-23 13:32:50 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00
Kan Wu	f519ab6693	Refactor configurables. PiperOrigin-RevId: 651259154	2024-07-10 21:30:58 -07:00
Jan Wassenberg	cbb67b4ee0	Move benchmark_helper to evals/, weights_raw to compression/. PiperOrigin-RevId: 650155983	2024-07-08 01:13:23 -07:00
Jan Wassenberg	f823371691	Cleanup: move util/compress and convert_weights to compression/ Also remove unused models/, lint convert_weights PiperOrigin-RevId: 649613088	2024-07-05 04:16:52 -07:00
Jan Wassenberg	41efec4dba	Add Py bindings for weight compression TODO: this uses clif instead of pybind11, and depends on absl. PiperOrigin-RevId: 649575815	2024-07-05 01:06:00 -07:00
Jan Wassenberg	d3c6a45b59	Major duplicated code reduction in test/benchmarks Helper functions to tokenize/wrap Move LayersOutputFunc into RuntimeConfig AcceptFunc passes the probability Implement StringFromType using the parser, and verify results match PiperOrigin-RevId: 643255119	2024-06-14 00:16:25 -07:00
Jan Wassenberg	a0e808e341	Add compression/ comments, especially on SFP range PiperOrigin-RevId: 642238720	2024-06-11 05:47:49 -07:00
Jan Wassenberg	5c3e5f7038	Remove no longer required stats.h - use Highway version instead PiperOrigin-RevId: 640440379	2024-06-05 01:37:48 -07:00
Paul Chang	175e389c3c	revert back to HWY_ASSERT for lane constraints, qualify hn::Add PiperOrigin-RevId: 640193239	2024-06-04 10:10:18 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Jan Wassenberg	a44cbdadc2	Update to Highway 1.2 for topology/VQSelect Also fix unused-warning in compress-inl. PiperOrigin-RevId: 639116915	2024-05-31 12:29:10 -07:00
Paul Chang	c0643577c3	Minor internal refactoring. PiperOrigin-RevId: 635852078	2024-05-21 10:29:59 -07:00
Paul Chang	cfce314715	Make BlobWriter::Add() accept const void* PiperOrigin-RevId: 634780483	2024-05-17 08:11:06 -07:00
Jan Wassenberg	22fe9809ac	Fix SVE build: add missing hn:: PiperOrigin-RevId: 632481097	2024-05-10 06:49:26 -07:00
Jan Wassenberg	c5c9fc300c	Enable even/odd for SFP. Refs #166 Disable it for float32 because there is not enough benefit. PiperOrigin-RevId: 631788326	2024-05-08 07:09:06 -07:00
Jan Wassenberg	f6d02b2870	Fix RecurrentGemma (refs #166 ) - one Dot was ignoring scale. Remove extra Dot() overload MatVecAdd always adds, use MatVecT<kAdd> if conditional. Remove ununsed MatVecAddLoop and MatVecLoop No longer tsan-verify even_odd PiperOrigin-RevId: 631377279	2024-05-07 04:40:42 -07:00
Jan Wassenberg	b5a9ade75f	2x speedup of SFP decode (1.4x overall) on AVX3_DL+. Thanks @nzmichaelh for suggesting table lookups! PiperOrigin-RevId: 631337524	2024-05-07 01:46:43 -07:00
Zoltan Szabadka	429eb78512	Remove unused vars.	2024-05-03 13:37:17 +00:00
Sam Kaufman	f608337fef	Remove Bf16ToF32EO and use PromoteEvenTo and PromoteOddTo.	2024-04-29 14:13:07 -07:00

1 2

83 Commits