gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	2d14d796e3	1.09x decode speedup for topk=1/temp0: fuse softmax and sample PiperOrigin-RevId: 680589099	2024-09-30 08:37:41 -07:00
Jan Wassenberg	5e812f07f5	Use f64 Dot and sum in softmax - faster than Cascaded Also let the kernel specify the Raw and State types, rename WeightT/VecT -> WT/VT. PiperOrigin-RevId: 680464427	2024-09-30 01:22:09 -07:00
Jan Wassenberg	35fdf848c7	Cascaded summation for Softmax This can affect generation results after a few hundred tokens. Also remove profiler from DecompressAndCall, use Add instead of +=, use PackedSpan for args and remove alignment requirement. Changing accumulation order in AssimilateCascadedSums updates dot_test thresholds. PiperOrigin-RevId: 676891797	2024-09-20 10:31:23 -07:00
Daniel Keysers	03f0ee2323	Add tests for SampleTopK that highlight existing problems and fix those: - Sampling was not correct for k>1 and temperature=0. - Sampling was not correct for only negative logits. Also restructure the code a bit for better readability and add some asserts for things that shouldn't happen. PiperOrigin-RevId: 676043267	2024-09-18 10:32:01 -07:00
Daniel Keysers	892f3bbcbe	Implement scalar version of LayerNorm PiperOrigin-RevId: 675085495	2024-09-16 03:54:10 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Jan Wassenberg	4033ed9e78	Avoid duplication of RMSNorm, support all activation/weight types Add test for RMSNorm Rename VectorizedRopeAndMulBy -> RopeAndMulBy Move test_util to util/ PiperOrigin-RevId: 668332927	2024-08-28 01:26:55 -07:00
Jan Wassenberg	b6d0ca8a14	Minor followup: remainder handling is a single iteration Also add profiler annotations. PiperOrigin-RevId: 667883774	2024-08-27 01:19:44 -07:00
Apoorv Reddy	48d0801fb0	Vectorize Rope for qkv dim not evenly divisible by number of lanes. PiperOrigin-RevId: 665776602	2024-08-21 02:22:22 -07:00
Apoorv Reddy	c6eb3b6f0d	VectorizedRopeAndMulBy. ~8x reduction (tested on few prompts) in Rope. ~3.8% prefill latency improvement. ~2.6% decode latency improvement. PiperOrigin-RevId: 664650108	2024-08-18 23:17:01 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00

1 2

62 Commits