gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	c4398fc72d	Infra improvements: allocator: support mmap, fixed Bind, add padding bench_matmul: Add PreventElision BUILD: add ops_test build target matmul.h: move ConstMat here; dynamic alloc of MatMulEnv matmul_test: remove benchmarking replace fprintf with HWY_WARN threading.cc: support splitting large clusters (disabled); package_idx->pkg_idx, smaller IndexRangePartition PiperOrigin-RevId: 717512274	2025-01-20 06:22:49 -08:00
Ray Smith	b93231a47d	Moved the vit config fields to their own config struct PiperOrigin-RevId: 715692800	2025-01-15 01:09:49 -08:00
Daniel Keysers	aed17396be	Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. Do not echo <end_of_turn> tokens to the user. Have verbosity=0 only show the dialog. PiperOrigin-RevId: 705021391	2024-12-11 01:52:00 -08:00
Daniel Keysers	719699f132	Make top_k a runtime argument (instead of a model argument). PiperOrigin-RevId: 696170691	2024-11-13 09:48:59 -08:00
Daniel Keysers	e54d9cbddd	Fix Griffin model: - use HalfRope position encodings - zero-initialize the caches for each Generate at position 0 The lack of the latter made the tests in gemma_test dependent on each other. PiperOrigin-RevId: 694509054	2024-11-08 08:30:53 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Daniel Keysers	c6384574db	Fix PaliGemma's GenerateImageTokensT(). Move image related config values from LayerConfig to ModelConfig. Minor changes: Add a few comments, remove gcpp:: qualification where it wasn't needed in a few places, define local constants in VitAttention.DotSoftmaxWeightedSum() PiperOrigin-RevId: 687210519	2024-10-18 01:34:13 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Daniel Keysers	a4d6adbc43	Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. Remove max_tokens (and rely on only max_generated_tokens). PiperOrigin-RevId: 685662260	2024-10-14 04:45:21 -07:00
Jan Wassenberg	6ab3ff5bde	Minor cleanup, Windows+Bazel build fixes add app.h comment compress-inl: remove unused typedef gemma-inl: add missing HWY_ATTR and cast separate sum-inl.h and basics.h headers replace more hwy::bfloat16_t with BF16 update include pragmas update dot_test thresholds update Highway version in Bazel for HWY_RCAST_ALIGNED fix PiperOrigin-RevId: 684464326	2024-10-10 09:05:06 -07:00
Ray Smith	85958f5fd3	Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. Definition of array size is moved to the constructor. Allocation is separate and parallelized. All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted. Replaced all previous ForEachTensor functions with a single unified function. PiperOrigin-RevId: 684451604	2024-10-10 08:22:30 -07:00
Jan Wassenberg	bd53b0f7c3	Fix MSAN issue for multiturn. Rewind the prior EOS token. Also move MaybeCheckInitialized to allocator.h PiperOrigin-RevId: 683187458	2024-10-07 08:07:54 -07:00
Jan Wassenberg	96d2ab7d31	Minor fix to profiler zone and add comment PiperOrigin-RevId: 681350546	2024-10-02 01:37:50 -07:00
Jan Wassenberg	7d9fcda0d8	-467ms startup: parallel Reshape Also split Softmax into Argmax helper, add comments; add profiler zones + fix IDE warning PiperOrigin-RevId: 680954573	2024-10-01 04:11:35 -07:00
Jan Wassenberg	2d14d796e3	1.09x decode speedup for topk=1/temp0: fuse softmax and sample PiperOrigin-RevId: 680589099	2024-09-30 08:37:41 -07:00
RangerUFO	d1010337c3	Fix prefix-LM mode assertion	2024-09-25 22:22:28 +08:00
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Jan Wassenberg	c29e9752c7	Refactor/cleanup, remove even_odd * New compression/shared.h, remove sfp.h * Remove unused DistortionStats b_l1_ * Move exact arithmetic functions into fp_arith * Remove even_odd optimization for MatVec (mostly unused) * use BF16 typedef more widely * Add kMaxSFP constant PiperOrigin-RevId: 670996386	2024-09-04 09:25:13 -07:00
Zoltan Szabadka	f6abbab3a4	Fix asan failure in local attention computation. PiperOrigin-RevId: 670207380	2024-09-02 07:06:10 -07:00
Jan Wassenberg	4033ed9e78	Avoid duplication of RMSNorm, support all activation/weight types Add test for RMSNorm Rename VectorizedRopeAndMulBy -> RopeAndMulBy Move test_util to util/ PiperOrigin-RevId: 668332927	2024-08-28 01:26:55 -07:00
Daniel Keysers	18e6012872	Fix prefill for batched queries. This lets gemma_test/GeographyBatched pass now also for gemma2-27B. PiperOrigin-RevId: 664827485	2024-08-19 08:50:42 -07:00
Apoorv Reddy	c6eb3b6f0d	VectorizedRopeAndMulBy. ~8x reduction (tested on few prompts) in Rope. ~3.8% prefill latency improvement. ~2.6% decode latency improvement. PiperOrigin-RevId: 664650108	2024-08-18 23:17:01 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
Paul Chang	b9ed12a325	Support directly observing activations, partially replacing LayersOutputFunc LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs. Instead, we directly expose the Activations structure. PiperOrigin-RevId: 663409316	2024-08-15 12:39:07 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
Copybara-Service	6763afcd1c	Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen PiperOrigin-RevId: 662533529	2024-08-13 08:51:06 -07:00
RangerUFO	8c634f6486	Fix the position calculation issue in the generation phase	2024-08-12 18:50:23 +02:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Jan Wassenberg	a24eda8d02	Split matmul into matvec; add large matrix benchmark Rename var names to row/col for more clarity. Better estimate error tolerance via max abs col sum. PiperOrigin-RevId: 657601791	2024-07-30 08:29:11 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	8b4915f321	Fix Windows build - macro conflict with param name PiperOrigin-RevId: 657518587	2024-07-30 03:22:32 -07:00
Jan Wassenberg	6ea4232b2e	MatMul cleanup: Mat struct, simplify args. Add large benchmark to test, use 4 threads, skip some targets. Also use Traits::Name instead of typeid. PiperOrigin-RevId: 657496185	2024-07-30 01:55:50 -07:00
Jan Wassenberg	f27683152c	1.05x prefill speedup: matvec -> matmul for !MHA Also add C_stride and make shape normal non-template arguments. PiperOrigin-RevId: 657285945	2024-07-29 12:18:06 -07:00
Jan Wassenberg	2721f54446	Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup PiperOrigin-RevId: 657167257	2024-07-29 05:34:26 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
Daniel Keysers	2346b5a434	Minor polishing: adding comments, renaming variables. PiperOrigin-RevId: 655235006	2024-07-23 11:17:44 -07:00
Daniel Keysers	33334ad454	Fix msan uninitialized scale in optimize_test PiperOrigin-RevId: 654817460	2024-07-22 10:50:25 -07:00
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00
Jan Wassenberg	5844e6a1e5	Cleanup: add wrapper functions and rename vars to interleaved Simplifies the TransformerLayer function. Use interleaved* instead of _and_queries. PiperOrigin-RevId: 653929449	2024-07-19 02:04:11 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Kan Wu	f519ab6693	Refactor configurables. PiperOrigin-RevId: 651259154	2024-07-10 21:30:58 -07:00

1 2

52 Commits