gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	7263ab8445	MatMul simplification, threading strategy improvements remove MatMul f32 special case (smaller code), types: Add u32/u64 for use by Activations move renamed ParallelismStrategy to threading_context so can pass ctx ensure worker index is unique across clusters matmul.h: const member functions for renamed policy classes (easier to call) PiperOrigin-RevId: 802848086	2025-09-03 21:45:07 -07:00
Marie White	74ffe079c4	Create separate MMStorage objects per cluster. PiperOrigin-RevId: 802588625	2025-09-03 09:35:48 -07:00
Jan Wassenberg	b7b3d353db	Simplify MatMul: remove F32 special case (build time) Also move kMaxM into separate kMaxBatchSize PiperOrigin-RevId: 802086590	2025-09-02 04:29:21 -07:00
Marie White	973e284ed6	Refactor Matmul to use a policy class for parallelization. PiperOrigin-RevId: 800864489	2025-08-29 05:40:39 -07:00
Jan Wassenberg	7288891439	Remove F64 partial storage in matmul. Also remove no longer used kMaxN; row_ptrs only used for C PiperOrigin-RevId: 800774757	2025-08-29 00:12:08 -07:00
Jan Wassenberg	31c09cca4c	f32 LoopKC: 1.37x(M=512), 1.19(M=128) single-K F32,BF16 matmul speedup on SKX Add a special case for A=F32,B=BF16, used when there is no native bf16 dot product. dot-inl: ensure bf16,f32 and f32,bf16 both get promoted to float before f64 summation matmul.cc: update autotuning to reflect actual A size matmul_test: add all combinations of bf16/f32, report all results, not just first difference, check non-vector-aligned K PiperOrigin-RevId: 800487817	2025-08-28 08:55:50 -07:00
Jan Wassenberg	701841897b	Default to disabling per-socket parallelization weights: default to Read for small-batch (only look at qbatch, not the larger prefill tbatch) PiperOrigin-RevId: 790787643	2025-08-04 09:49:14 -07:00
Jan Wassenberg	0f70f285e0	1.1x prefill and decode speedup (attention/activations) Optimizations - Better load-balancing in attention threading (Previously, clusters were limited by #heads) - Add MulByConstTo to avoid zero-init - Parallel activations Cleanup - Prepare for RowPtr in A or B - Pass through thread_id to ops - Avoid warning in bench_matmul PiperOrigin-RevId: 773723423	2025-06-20 08:59:53 -07:00
Jan Wassenberg	2c72ff2aa5	Fix MatMul issue caused by autotuning bucketing, refs #608 , thanks @ufownl PiperOrigin-RevId: 771077158	2025-06-13 06:58:42 -07:00
Jan Wassenberg	01cdefeda7	1.64x batch=1 prefill speedup: nested parallelization for Attention (DotSoftmaxWeightedSum) Also fix tsan error in matmul (atomic_flag instead of static) PiperOrigin-RevId: 770241705	2025-06-11 11:28:46 -07:00
Jan Wassenberg	6ee628ba38	Further cleanup: separate MatMulEnv arg move row_ptrs into MatMulEnv Consistent arg order: layer, activations, kv_cache, env PiperOrigin-RevId: 767886386	2025-06-05 20:48:32 -07:00
Jan Wassenberg	839a642992	Fix paligemma_test, refs #588 Detect PaliGemma models from layer names Remove unused allocator arg from CreateInvTimescale matmul: only warn once about dim divisibility Print config also in tests if --verbosity 2 PiperOrigin-RevId: 766605131	2025-06-03 04:45:22 -07:00
Jan Wassenberg	cb188d4a0e	Fix RowT issue and improve Griffin (currently still broken) Use type-safe MatPtrT via dynamic_cast, avoid/remove unsafe RowT activations: Griffin tensors are now padded Griffin: add batching support, fix conv1d_cache allocation weights: bundle to TensorToRead, add kNoPad flag, fix SplitW1 const-correct fix for ForEachTensor blob_store: move BlobIO2 to .cc and rename BlobIO PiperOrigin-RevId: 760610094	2025-05-19 07:02:10 -07:00
Jan Wassenberg	e890d46f30	1.31x batch prefill, 1.24x batch decode speedup: NUMA binding Only the weights; binding MatMul output worsens batch=1 prefill. Update gemma_batch_bench to use --decode_qbatch. Fix/remove prefill_activations in gemma-inl.h. Refactor: use BasePageBytes directly when binding Move BindB/C to .cc by de-templatizing Remove MatOwners::AllocateFor because it is weights-specific (binding or not) Disband MatOwners, replace with vector PiperOrigin-RevId: 759610477	2025-05-16 07:42:13 -07:00
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Jan Wassenberg	e55734219d	Fix test threshold and improve warning output PiperOrigin-RevId: 740738937	2025-03-26 06:11:27 -07:00
Jan Wassenberg	1b72c22345	Refactor Gemma ctor and improve pool NUMA support Gemma receives a MatMulEnv arg, with comment on lifetime Split threading into topology so the latter can be used in allocator Add AllocClasses() for non-POD (ThreadPool) Support binding pool to NUMA node Update threading_test with latency measurements Also update Highway version. PiperOrigin-RevId: 736904748	2025-03-14 10:19:00 -07:00
Jan Wassenberg	2bdf26d81d	Support bf16 output of Matmul Adds Stride to ConstMat, to support decompression of C output for test matmul_test: add line numbers to output Also ignore "N is not a multiple of nc" when N==nc PiperOrigin-RevId: 731096662	2025-02-25 17:53:20 -08:00
Jan Wassenberg	f9d93e4a42	Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning Remove empty matmul_unit_test. Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576. PiperOrigin-RevId: 729123576	2025-02-20 08:33:46 -08:00

20 Commits