gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	2308514e5a	Experiment with compensated dot product. ULP difference vs exact is 0..1, vs 200-5000 for previous. Runtime overhead is 2.5-4x for f32 input. PiperOrigin-RevId: 668084019	2024-08-27 12:05:35 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
Jan Wassenberg	1617e1a33d	SFP speedup: 1.14x f32, 1.19x bf16 dot = 1.02x prefill 12->9 ops by recognizing the upper/lower bytes are simply shifted. PiperOrigin-RevId: 659609241	2024-08-05 10:59:13 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00
Jan Wassenberg	a0e808e341	Add compression/ comments, especially on SFP range PiperOrigin-RevId: 642238720	2024-06-11 05:47:49 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Jan Wassenberg	a44cbdadc2	Update to Highway 1.2 for topology/VQSelect Also fix unused-warning in compress-inl. PiperOrigin-RevId: 639116915	2024-05-31 12:29:10 -07:00
Jan Wassenberg	c5c9fc300c	Enable even/odd for SFP. Refs #166 Disable it for float32 because there is not enough benefit. PiperOrigin-RevId: 631788326	2024-05-08 07:09:06 -07:00
Jan Wassenberg	b5a9ade75f	2x speedup of SFP decode (1.4x overall) on AVX3_DL+. Thanks @nzmichaelh for suggesting table lookups! PiperOrigin-RevId: 631337524	2024-05-07 01:46:43 -07:00
Jan Wassenberg	a982ec1287	Move code to gemma/ so we can remove error-prone copybara: comments. Also fix includes and Lint warnings. PiperOrigin-RevId: 623127487	2024-04-09 04:45:42 -07:00
Jan Wassenberg	24add61dd9	Fix SFP/NUQ for bf16 rounding in Highway SFP: Avoid rounding twice, and more robust TestDot. NUQ: also more robust SNR, minor touchups to header. PiperOrigin-RevId: 618030096	2024-03-21 19:06:19 -07:00
Austin Huang	e29cd566cf	initial commit	2024-02-21 03:31:22 +00:00

12 Commits