gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	275135d7e8	Rename-only: remove Allocator2 etc suffixes now that refactoring is complete PiperOrigin-RevId: 755397220	2025-05-06 09:12:43 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Jan Wassenberg	a60b564b88	Infra improvements (2) ops.h: move CreateInvTimescale to allow calling without depending on gemma Pass around MatMulEnv instead of pools to avoid re-creating the env profiler.h can now be used outside SIMD code allocator: add StepBytes and QuantumSteps rename worker thread with package/cluster in the name threading: add Visit* to IndexRange PiperOrigin-RevId: 718766704	2025-01-23 01:55:19 -08:00
Daniel Keysers	f37402da57	Add parameter for base_frequency to CreateInvTimeScale(). Extract a few local variables to make code easier to read (hopefully). PiperOrigin-RevId: 718749053	2025-01-23 00:56:44 -08:00
Jan Wassenberg	c4398fc72d	Infra improvements: allocator: support mmap, fixed Bind, add padding bench_matmul: Add PreventElision BUILD: add ops_test build target matmul.h: move ConstMat here; dynamic alloc of MatMulEnv matmul_test: remove benchmarking replace fprintf with HWY_WARN threading.cc: support splitting large clusters (disabled); package_idx->pkg_idx, smaller IndexRangePartition PiperOrigin-RevId: 717512274	2025-01-20 06:22:49 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Jan Wassenberg	6ab3ff5bde	Minor cleanup, Windows+Bazel build fixes add app.h comment compress-inl: remove unused typedef gemma-inl: add missing HWY_ATTR and cast separate sum-inl.h and basics.h headers replace more hwy::bfloat16_t with BF16 update include pragmas update dot_test thresholds update Highway version in Bazel for HWY_RCAST_ALIGNED fix PiperOrigin-RevId: 684464326	2024-10-10 09:05:06 -07:00
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Jan Wassenberg	c29e9752c7	Refactor/cleanup, remove even_odd * New compression/shared.h, remove sfp.h * Remove unused DistortionStats b_l1_ * Move exact arithmetic functions into fp_arith * Remove even_odd optimization for MatVec (mostly unused) * use BF16 typedef more widely * Add kMaxSFP constant PiperOrigin-RevId: 670996386	2024-09-04 09:25:13 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Jan Wassenberg	c7c3daa624	7x compile time speedup: shard gemma.cc Use overloaded functions defined in gemma/instantiations. Also split out activations.h. PiperOrigin-RevId: 649053122	2024-07-03 06:35:04 -07:00
Jan Wassenberg	09a7e75ead	Prep for sharding gemma.cc: split into kv_cache, tokenizer. Move activations.h to backprop/ to make space for another activations.h. PiperOrigin-RevId: 648744500	2024-07-02 09:31:06 -07:00
Zoltan Szabadka	c004799cdc	Add Adam optimizer. Drive-by: Fix compilation errors and tests for backprop functions.	2024-06-06 18:41:36 +00:00
Jan Wassenberg	57c2cd8b52	Simplifications: remove GemmaInterface and GemmaImpl Split common and weights into separate lib Remove common-inl (does not have to be SIMD code), activations.cc Centralize switch(Model) to avoid duplication Move CompressWeightsT to compress_weights.cc Move LoadWeights to weights.cc PiperOrigin-RevId: 640869202	2024-06-06 05:54:21 -07:00
Zoltan Szabadka	8567978541	Adress review comments	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00

23 Commits