gemma.cpp

Commit Graph

Author	SHA1	Message	Date
RangerUFO	cc2e14e654	Improve `GemmaChatTemplate` to handle vision prompt wrapping	2025-03-29 11:31:40 +08:00
RangerUFO	c39295f497	Inline the ctor of `GemmaChatTemplate`	2025-03-29 11:31:40 +08:00
RangerUFO	d1615b56b2	Fix the prompt wrapping of gemma3-1b again It seems that the previous fix was changed back due to a merge error.	2025-03-29 11:31:39 +08:00
RangerUFO	ca4ee2b63f	Refactor `WrapAndTokenize` to work properly with Gemma3	2025-03-29 11:31:39 +08:00
RangerUFO	d42deaa27c	Set the secondary EOS for Gemma2 So that we can remove the `<end_of_turn>` filter that was set up specifically for Gemma2.	2025-03-22 01:32:22 +08:00
RangerUFO	2bad79f110	Fix the EOS checking The secondary eos is usually `<end_of_turn>`, which can appear in the prompt, so we can only check it not in the prompt.	2025-03-22 01:32:22 +08:00
Phil Culliton	05b1cce9f7	Add support for a secondary EOS token PiperOrigin-RevId: 738898976	2025-03-20 12:28:31 -07:00
Jan Wassenberg	83219e3c68	Add note on attention length and SFP PiperOrigin-RevId: 738698399	2025-03-20 00:39:06 -07:00
RangerUFO	b16ce9a0b4	Fix the prompt wrapping of gemma3-1b	2025-03-18 16:52:38 +08:00
Jan Wassenberg	1b72c22345	Refactor Gemma ctor and improve pool NUMA support Gemma receives a MatMulEnv arg, with comment on lifetime Split threading into topology so the latter can be used in allocator Add AllocClasses() for non-POD (ThreadPool) Support binding pool to NUMA node Update threading_test with latency measurements Also update Highway version. PiperOrigin-RevId: 736904748	2025-03-14 10:19:00 -07:00
Phil Culliton	1b1b63d560	Fix PaliGemma models. PiperOrigin-RevId: 736483021	2025-03-13 06:28:29 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Phil Culliton	9d83ff202e	Internal change. PiperOrigin-RevId: 736014152	2025-03-11 23:10:48 -07:00
Jan Wassenberg	2bdf26d81d	Support bf16 output of Matmul Adds Stride to ConstMat, to support decompression of C output for test matmul_test: add line numbers to output Also ignore "N is not a multiple of nc" when N==nc PiperOrigin-RevId: 731096662	2025-02-25 17:53:20 -08:00
Jan Wassenberg	b3b4b9f92f	With new matmul, much larger batch sizes are advantageous, default to 256. Can still override via command line argument. PiperOrigin-RevId: 730502653	2025-02-24 10:21:58 -08:00
Jan Wassenberg	f9d93e4a42	Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning Remove empty matmul_unit_test. Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576. PiperOrigin-RevId: 729123576	2025-02-20 08:33:46 -08:00
Apoorv Reddy	0e5b59d24d	Implements FusedSoftmaxAndSampleTopK. This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0 PiperOrigin-RevId: 727702149	2025-02-16 21:30:06 -08:00
Copybara-Service	c495b25995	Merge pull request #493 from ufownl:bugfix/compress_weights_le PiperOrigin-RevId: 725585921	2025-02-11 05:10:13 -08:00
Apoorv Reddy	64cf6dfe0a	Using TimingInfo methods and cleaning up args to DecodeStepT PiperOrigin-RevId: 725580125	2025-02-11 04:49:14 -08:00
Apoorv Reddy	780e376023	Add KVCache.DeepCopy() . Will be useful for implementing sampling functionality like beam sampling, parallel sampling, CoT Decoding (à la https://arxiv.org/abs/2402.10200 ) PiperOrigin-RevId: 725156316	2025-02-10 04:10:29 -08:00
Apoorv Reddy	9b3e7ea8a2	Factor out DecodeStepT from GenerateT into a separate function. This will be useful for adding sampling functionality like beam decoding, parallel sampling, cot decoding (as described in the [Chain-of-Thought Reasoning Without Prompting paper](https://arxiv.org/abs/2402.10200)) PiperOrigin-RevId: 725151530	2025-02-10 03:53:08 -08:00
RangerUFO	3a5a6dbcad	Fix the link error when building `compress_weights` with Clang on macOS	2025-02-09 00:13:25 +08:00
Jan Wassenberg	b18bd781f6	Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts PiperOrigin-RevId: 724340518	2025-02-07 07:38:55 -08:00
Phil Culliton	7ccc6abe87	Allow conversion, loading and inference with NUQ. PiperOrigin-RevId: 723507890	2025-02-05 07:45:54 -08:00
Daniel Keysers	bcdb0d65bd	Assorted small cleanups. PiperOrigin-RevId: 720548132	2025-01-28 06:09:45 -08:00
Daniel Keysers	e997468496	Apply PositionalEncodingQK always in-place. PiperOrigin-RevId: 718851803	2025-01-23 07:09:30 -08:00
Apoorv Reddy	ce807a31a1	internal change PiperOrigin-RevId: 718824952	2025-01-23 05:31:11 -08:00
Jan Wassenberg	a60b564b88	Infra improvements (2) ops.h: move CreateInvTimescale to allow calling without depending on gemma Pass around MatMulEnv instead of pools to avoid re-creating the env profiler.h can now be used outside SIMD code allocator: add StepBytes and QuantumSteps rename worker thread with package/cluster in the name threading: add Visit* to IndexRange PiperOrigin-RevId: 718766704	2025-01-23 01:55:19 -08:00
Daniel Keysers	f37402da57	Add parameter for base_frequency to CreateInvTimeScale(). Extract a few local variables to make code easier to read (hopefully). PiperOrigin-RevId: 718749053	2025-01-23 00:56:44 -08:00
Phil Culliton	9646edc908	Internal change PiperOrigin-RevId: 717916568	2025-01-21 07:53:49 -08:00
Jan Wassenberg	c4398fc72d	Infra improvements: allocator: support mmap, fixed Bind, add padding bench_matmul: Add PreventElision BUILD: add ops_test build target matmul.h: move ConstMat here; dynamic alloc of MatMulEnv matmul_test: remove benchmarking replace fprintf with HWY_WARN threading.cc: support splitting large clusters (disabled); package_idx->pkg_idx, smaller IndexRangePartition PiperOrigin-RevId: 717512274	2025-01-20 06:22:49 -08:00
Daniel Keysers	493688f6f1	Allow interactive use with new single-file weight format. Add section about new weights format to README.md. Remove model_type_required parameter. Update error handling for flags. PiperOrigin-RevId: 715788822	2025-01-15 07:22:33 -08:00
Ray Smith	b93231a47d	Moved the vit config fields to their own config struct PiperOrigin-RevId: 715692800	2025-01-15 01:09:49 -08:00
Ray Smith	9d40f0117e	Added ability to load/save a complete model file, including tokenizer. PiperOrigin-RevId: 707914366	2024-12-19 07:59:41 -08:00
Daniel Keysers	62c70d6715	Rename ModelTraining to PromptWrapping which is a more accurate name. PiperOrigin-RevId: 705881500	2024-12-13 07:45:59 -08:00
Ray Smith	6254f2e5ca	Removed duplicated tensor sizes from weights.h by changing the constructor used for MatPtrT PiperOrigin-RevId: 705085054	2024-12-11 06:30:28 -08:00
Daniel Keysers	aed17396be	Make prompt wrapping more consistent and fix duplicated tokens for multi-turn. Do not echo <end_of_turn> tokens to the user. Have verbosity=0 only show the dialog. PiperOrigin-RevId: 705021391	2024-12-11 01:52:00 -08:00
Ray Smith	e69bc3bc1c	Added the TensorInfo arg to the compressor so the shape and scale can be output correctly to the file in future. Corrected some errors in the TensorIndex. PiperOrigin-RevId: 705014619	2024-12-11 01:26:35 -08:00
Copybara-Service	d8135e836f	Merge pull request #460 from ericcurtin:common PiperOrigin-RevId: 704684454	2024-12-10 06:33:37 -08:00
Daniel Keysers	331d2ccc02	Add support for 448px resolution to PaliGemma and PaliGemma2. PiperOrigin-RevId: 704361579	2024-12-09 11:38:10 -08:00
Eric Curtin	a971088ac2	Refactor `gemma/common.cc` to improve readability and safety Use `std::size` for array size calculations. Replace C-style string manipulations with `std::string` methods. Simplify `std::transform` usage for case conversion. Signed-off-by: Eric Curtin <ecurtin@redhat.com>	2024-12-09 16:36:25 +00:00
Phil Culliton	9dfe2a76be	Internal change PiperOrigin-RevId: 702961613	2024-12-04 20:41:47 -08:00
Ray Smith	3d1625d8c5	Improved consistency of compressor API, and added a universal method with a target type arg. Moved configs pybind up to root level. PiperOrigin-RevId: 698743417	2024-11-21 05:27:40 -08:00
Ray Smith	73640d2521	Added tensor_index as a single source of truth on tensor shapes/sources and transformations PiperOrigin-RevId: 697903886	2024-11-19 00:25:39 -08:00
Ray Smith	7d685a267f	Added pybind for configs. Added ability to test configs for equality. PiperOrigin-RevId: 697572671	2024-11-18 04:03:51 -08:00
Daniel Keysers	719699f132	Make top_k a runtime argument (instead of a model argument). PiperOrigin-RevId: 696170691	2024-11-13 09:48:59 -08:00
Daniel Keysers	e54d9cbddd	Fix Griffin model: - use HalfRope position encodings - zero-initialize the caches for each Generate at position 0 The lack of the latter made the tests in gemma_test dependent on each other. PiperOrigin-RevId: 694509054	2024-11-08 08:30:53 -08:00
Jan Wassenberg	868b01601f	Simpler MatMul interface, vocab types, Tristate for use_spinning Add Extents2D, Range2D vocab types Matmul uses ConstMat for inputs and RowPtr for output Move RowVectorBatch to basics.h Separate threading.cc Fix topology string: report cores not LPs, and #HT Move QStride/IsMHA into LayerConfig ImageTokens does not require make_unique. matmul_test: no longer require template args PiperOrigin-RevId: 692963605	2024-11-04 07:48:29 -08:00
Daniel Keysers	583bd93e9a	Factor out addition of ViTConfig to a ModelConfig. Use ModelConfig values for ImageTokens. Output timing info for image token generation. Add a method to copy image data into Image class directly. Minor changes: pipe ModelTraining to more places. PiperOrigin-RevId: 690572283	2024-10-28 05:29:33 -07:00
Jan Wassenberg	02ce1e344f	Use NestedPools, add NUMA infra Improved threading.h, fix thread counts for single package/cluster systems Temporarily forces to a single socket. Prefill 29.28 tps, decode 6.92. Also fix benchmarks.cc build, update tensor allocator to Allocator PiperOrigin-RevId: 687307167	2024-10-18 08:11:18 -07:00
Daniel Keysers	c6384574db	Fix PaliGemma's GenerateImageTokensT(). Move image related config values from LayerConfig to ModelConfig. Minor changes: Add a few comments, remove gcpp:: qualification where it wasn't needed in a few places, define local constants in VitAttention.DotSoftmaxWeightedSum() PiperOrigin-RevId: 687210519	2024-10-18 01:34:13 -07:00
Ray Smith	0d68555f87	Eliminated TConfig. Changed CompressedLayer and CompressedWeights to be constructed with an instance of a LayerConfig and WeightsConfig respectively. Added CompressedModel to remove ByteStorageT and get rid of most of the type casting, as well as allowing the default destructor to be used and work properly. Adjusted WeightsWrapper and ForwardLayer etc to match. The only remaining template arg is the weight type. This enables all the instantiations to be deleted, apart from one per type. It also enables (but not yet done) the config to be stored in the blob file instead of having to be specified separately. Reduces the size of the gemma_lib and weights shared libraries by a factor of 4.3 and 3.2 respectively. PiperOrigin-RevId: 686870060	2024-10-17 05:04:22 -07:00
Daniel Keysers	a4d6adbc43	Introduce QueryResult in GemmaEnv and add a shortcut for WrapAndTokenize. Remove max_tokens (and rely on only max_generated_tokens). PiperOrigin-RevId: 685662260	2024-10-14 04:45:21 -07:00
Daniel Keysers	5d0167904d	Fix PaliGemma model loading. PiperOrigin-RevId: 685591935	2024-10-13 23:48:55 -07:00
Jan Wassenberg	6ab3ff5bde	Minor cleanup, Windows+Bazel build fixes add app.h comment compress-inl: remove unused typedef gemma-inl: add missing HWY_ATTR and cast separate sum-inl.h and basics.h headers replace more hwy::bfloat16_t with BF16 update include pragmas update dot_test thresholds update Highway version in Bazel for HWY_RCAST_ALIGNED fix PiperOrigin-RevId: 684464326	2024-10-10 09:05:06 -07:00
Ray Smith	85958f5fd3	Added MatPtr/MatPtrT/MatStorageT/MatStorage as a dynamically-sized replacement for CompressedArray. Definition of array size is moved to the constructor. Allocation is separate and parallelized. All users of weights_raw.h migrated to CompressedWeights and weights_raw.h deleted. Replaced all previous ForEachTensor functions with a single unified function. PiperOrigin-RevId: 684451604	2024-10-10 08:22:30 -07:00
Jan Wassenberg	2c28b18eb0	Add NestedPools: one per socket/cluster Use in dot_test app.h: add new flags and rename num_threads to max_threads matmul: Parallelize MatMulSlow and enable spinning, more large/fewer medium test cases PiperOrigin-RevId: 683216386	2024-10-07 09:40:19 -07:00
Jan Wassenberg	bd53b0f7c3	Fix MSAN issue for multiturn. Rewind the prior EOS token. Also move MaybeCheckInitialized to allocator.h PiperOrigin-RevId: 683187458	2024-10-07 08:07:54 -07:00
Ray Smith	895ee4c6ce	Moved Internal code around to simplify PiperOrigin-RevId: 681877329	2024-10-03 07:55:21 -07:00
Krzysztof Ostrowski	12291e1ac0	Internal change. PiperOrigin-RevId: 681583569	2024-10-02 14:03:34 -07:00
Krzysztof Ostrowski	b3239bf509	Internal change. PiperOrigin-RevId: 681530185	2024-10-02 11:33:06 -07:00
Jan Wassenberg	96d2ab7d31	Minor fix to profiler zone and add comment PiperOrigin-RevId: 681350546	2024-10-02 01:37:50 -07:00
Jan Wassenberg	7d9fcda0d8	-467ms startup: parallel Reshape Also split Softmax into Argmax helper, add comments; add profiler zones + fix IDE warning PiperOrigin-RevId: 680954573	2024-10-01 04:11:35 -07:00
Jan Wassenberg	2d14d796e3	1.09x decode speedup for topk=1/temp0: fuse softmax and sample PiperOrigin-RevId: 680589099	2024-09-30 08:37:41 -07:00
Jan Wassenberg	897f902d28	Fix include order, required to build with profiler enabled PiperOrigin-RevId: 680574177	2024-09-30 07:52:50 -07:00
Daniel Keysers	606427022c	Fix compiler errors when trying to generate (unused) code for the ConfigNoVit struct. PiperOrigin-RevId: 679049377	2024-09-26 01:55:26 -07:00
RangerUFO	d1010337c3	Fix prefix-LM mode assertion	2024-09-25 22:22:28 +08:00
Daniel Keysers	f8835fe4a4	Add support for PaliGemma Vision-LM (224x224) to gemma.cpp See https://arxiv.org/abs/2407.07726 for a description of the model. Because PaliGemma operates as a prefix-LM on the image+prompt, add support for that. PiperOrigin-RevId: 677841119	2024-09-23 10:09:38 -07:00
Jan Wassenberg	8c0a8834c1	Major compression update, arbitrary-len unpack + new Dot Compression: * Implement {any packed} x {bf16, f32} 'Load2' and DecompressAndZeroPad * New compression test for all packed formats, add to GEMMA_TEST_FILES, remove from sfp/nuq_test * Decompress->DecompressAndZeroPad, use PackedSpan for args with bounds checking * NUQ: support arbitrary-length enc/dec * New compression/shared, remove sfp.h and nuq.h * Move Store2 into Traits and provide Compress2 wrapper * Remove unused Decompress()-with-pool overload * Simplify CompressedArrayLen, rename to CompressedArrayElements * Remove unused DistortionStats b_l1_ Misc: * Add compensated and Kahan dot, support any length * Use same Dot function everywhere * Move exact arithmetic functions into fp_arith * use FloatPtr and MatPtr typedefs in tests; less stack usage * Rename args to packed/raw * Remove Traits::Name, instead TypeName<T>() * Move kMaxSFP and kClusters/kGroupSize into Sfp/NuqStream PiperOrigin-RevId: 672868468	2024-09-10 02:22:19 -07:00
Jan Wassenberg	c29e9752c7	Refactor/cleanup, remove even_odd * New compression/shared.h, remove sfp.h * Remove unused DistortionStats b_l1_ * Move exact arithmetic functions into fp_arith * Remove even_odd optimization for MatVec (mostly unused) * use BF16 typedef more widely * Add kMaxSFP constant PiperOrigin-RevId: 670996386	2024-09-04 09:25:13 -07:00
Daniel Keysers	a8e08778d4	Add an additional QueryModel() overload to GemmaEnv. Use args only in GemmaEnv constructor, store everything else in RuntimeConfig. Add runtime option to turn off thread spinning. PiperOrigin-RevId: 670467320	2024-09-03 02:25:19 -07:00
Zoltan Szabadka	f6abbab3a4	Fix asan failure in local attention computation. PiperOrigin-RevId: 670207380	2024-09-02 07:06:10 -07:00
Jan Wassenberg	4033ed9e78	Avoid duplication of RMSNorm, support all activation/weight types Add test for RMSNorm Rename VectorizedRopeAndMulBy -> RopeAndMulBy Move test_util to util/ PiperOrigin-RevId: 668332927	2024-08-28 01:26:55 -07:00
Daniel Keysers	18e6012872	Fix prefill for batched queries. This lets gemma_test/GeographyBatched pass now also for gemma2-27B. PiperOrigin-RevId: 664827485	2024-08-19 08:50:42 -07:00
Apoorv Reddy	c6eb3b6f0d	VectorizedRopeAndMulBy. ~8x reduction (tested on few prompts) in Rope. ~3.8% prefill latency improvement. ~2.6% decode latency improvement. PiperOrigin-RevId: 664650108	2024-08-18 23:17:01 -07:00
Paul Chang	773333e5be	Expose underlying model configuration: number of layers, heads, etc. PiperOrigin-RevId: 663747853	2024-08-16 09:03:24 -07:00
Jan Wassenberg	301dc8067a	Major MatMul update, 1.9-2.3x speedup on Zen4 via bf16 mul Supports converting all weight/activation formats to native MulT (bf16/f32) Also: - ConstMat/MutableMat for const correctness - Move RowVectorBatch to allocator.h so it can be used from Matmul - Add matmul.h so MatMulEnv can be used from Activations - Remove kMaxThreads, detect from PerClusterPools - Build fix: -inl.h files must be textual_hdrs, and highway.h should precede -inl.h ``` zen4 new 64, 24576, 3072, add=0, MatTA=bf16, MatTB=sfp: 616.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=bf16, MatTB=sfp: 460.7 GFLOPS. 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 598.6 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 435.6 GFLOPS. zen4 old 64, 24576, 3072, add=0, MatTA=f32, MatTB=sfp: 257.5 GFLOPS. 64, 3072, 24576, add=0, MatTA=f32, MatTB=sfp: 231.9 GFLOPS. ``` PiperOrigin-RevId: 663729812	2024-08-16 07:52:20 -07:00
Paul Chang	b9ed12a325	Support directly observing activations, partially replacing LayersOutputFunc LayersOutputFunc is no longer invoked for "blocks" and "final_norm" outputs. Instead, we directly expose the Activations structure. PiperOrigin-RevId: 663409316	2024-08-15 12:39:07 -07:00
Jan Wassenberg	22995c699d	Simplify pos handling, auto-increment output arg - no longer multiply by num_queries - remove unused interleaved prompts - Rename to Queries* - Rename batch_start/interleaved_pos/pos to queries_pos PiperOrigin-RevId: 663331823	2024-08-15 09:25:26 -07:00
Copybara-Service	6763afcd1c	Merge pull request #348 from ufownl:feature/start_pos_per_query_reopen PiperOrigin-RevId: 662533529	2024-08-13 08:51:06 -07:00
RangerUFO	8c634f6486	Fix the position calculation issue in the generation phase	2024-08-12 18:50:23 +02:00
RangerUFO	730b6bfc94	Implement `start_pos` per query for batch interface	2024-08-12 18:50:23 +02:00
Jan Wassenberg	b831fa8482	1.3x prefill, 0.95x decode: matmul replacing last matvec Before 38.28, 9.17 (with profiler enabled, prompt = 330 tok) ``` Gen.FFW : 15414 x 4692352 = 24.166318 Gen.Attention.SumHeads : 15414 x 1394804 = 7.183451 !! Gen.Embedding : 361 x 49961894 = 6.026297 Gen.Attention.QKV : 15414 x 1005125 = 5.176546 Gen.Attention.DotSoftmax : 15414 x 885480 = 4.560357 RopeAndMulBy : 696528 x 11867 = 2.761818 ``` After 49.80, 8.68 ``` Gen.FFW : 14448 x 5312783 = 25.646868 Gen.Embedding : 338 x 63044815 = 7.119845 Gen.Attention.QKV : 14448 x 1115003 = 5.382557 Gen.Attention.DotSoftmax : 14448 x 897577 = 4.332957 RopeAndMulBy : 673344 x 11886 = 2.674156 Gen.Attention.SumHeads : 14448 x 518291 = 2.501993 !! ``` PiperOrigin-RevId: 662024085	2024-08-12 03:36:01 -07:00
Jan Wassenberg	282f73ec2f	Add pin flag to disable pinning. Refs #338 PiperOrigin-RevId: 661389171	2024-08-09 13:47:12 -07:00
Apoorv Reddy	fd1b0743a7	Rename Gemma9B and Gemma27B to Gemma2_9B and Gemma2_27B. This is to make it clear that these models are part of the Gemma2 family of models. PiperOrigin-RevId: 661181682	2024-08-09 02:09:06 -07:00
Jan Wassenberg	2ebbe4076f	1.03-1.08x decode speedup: precompute Rope theta, fuse Split attention into functions, move into class. Fuse Rope and MulBy, allow non-in-place version to avoid copy from q to KV. Sink if() into MaybeLogitsSoftCap. PiperOrigin-RevId: 661168418	2024-08-09 01:23:24 -07:00
The gemma.cpp Authors	27258b03e6	Improve performance logging PiperOrigin-RevId: 660534330	2024-08-07 14:15:43 -07:00
Jan Wassenberg	5e433e774a	1.1x prefill speedup, revamp threading in preparation for hierarchical parallelism. Limit thread counts to detected. Add max_clusters arg. Update detection logic to check for smt0 - previously we pinned to some siblings. PiperOrigin-RevId: 659755311	2024-08-05 18:50:09 -07:00
Phil Culliton	1982a6ba00	Internal change PiperOrigin-RevId: 657831926	2024-07-30 20:24:54 -07:00
Jan Wassenberg	a24eda8d02	Split matmul into matvec; add large matrix benchmark Rename var names to row/col for more clarity. Better estimate error tolerance via max abs col sum. PiperOrigin-RevId: 657601791	2024-07-30 08:29:11 -07:00
Paul Chang	d37c088e44	Extend LayersOutputFunc to take query index and auxillary int PiperOrigin-RevId: 657574814	2024-07-30 06:53:56 -07:00
Jan Wassenberg	8b4915f321	Fix Windows build - macro conflict with param name PiperOrigin-RevId: 657518587	2024-07-30 03:22:32 -07:00
Jan Wassenberg	6ea4232b2e	MatMul cleanup: Mat struct, simplify args. Add large benchmark to test, use 4 threads, skip some targets. Also use Traits::Name instead of typeid. PiperOrigin-RevId: 657496185	2024-07-30 01:55:50 -07:00
Jan Wassenberg	f27683152c	1.05x prefill speedup: matvec -> matmul for !MHA Also add C_stride and make shape normal non-template arguments. PiperOrigin-RevId: 657285945	2024-07-29 12:18:06 -07:00
Jan Wassenberg	2721f54446	Add offset arg to MatMul, rename, Matmul for logits = ~1.1x decode speedup PiperOrigin-RevId: 657167257	2024-07-29 05:34:26 -07:00
Jan Wassenberg	aaf51898b6	Major revamp #2 of Prefill: fix token order, parallel for multi-query - Allocate only the required KV caches and activation batch size - Add flags for batch sizes - Const-correct interface: Span of const int. - Also clean up the KVCache arg to a span. - Move kPrefillBatchSize into RuntimeConfig and remove related global constants. PiperOrigin-RevId: 655893197	2024-07-25 03:28:55 -07:00
Daniel Keysers	2346b5a434	Minor polishing: adding comments, renaming variables. PiperOrigin-RevId: 655235006	2024-07-23 11:17:44 -07:00
Daniel Keysers	33334ad454	Fix msan uninitialized scale in optimize_test PiperOrigin-RevId: 654817460	2024-07-22 10:50:25 -07:00
Jan Wassenberg	85cac13fb1	Split up ops.h into ops/ops-inl and matmul-inl PiperOrigin-RevId: 654068303	2024-07-19 11:21:48 -07:00
Jan Wassenberg	5844e6a1e5	Cleanup: add wrapper functions and rename vars to interleaved Simplifies the TransformerLayer function. Use interleaved* instead of _and_queries. PiperOrigin-RevId: 653929449	2024-07-19 02:04:11 -07:00
Jan Wassenberg	12016d31c3	Major Prefill/Generate cleanup, 1.3x Prefill speedup This fixes TTFT, which was not including prefill. PiperOrigin-RevId: 653690626	2024-07-18 11:16:46 -07:00
Jan Wassenberg	3fe79b3876	Fix msan uninitialized scale PiperOrigin-RevId: 653655471	2024-07-18 09:42:31 -07:00
Daniel Keysers	e87e65ca45	Add scale parameter to MatMul. Add accessor to CompressedArray that asserts the scale is 1 and use it. PiperOrigin-RevId: 653604840	2024-07-18 06:58:56 -07:00
Daniel Keysers	5a751a9a44	Update gemma-27b to the correct query scaling. PiperOrigin-RevId: 653201646	2024-07-17 05:43:52 -07:00
Jan Wassenberg	992a2cbbc0	De-templatize Activations, add RowVectorBatch class Also remove most kBatchSize args. PiperOrigin-RevId: 653185525	2024-07-17 04:38:15 -07:00
Daniel Keysers	ff34370aac	Simplify FFW by using MatMul_4x4_Batch_Add. Affects only the griffin model, where prefill TPS improves by about 70%. PiperOrigin-RevId: 652878176	2024-07-16 09:41:23 -07:00
Jan Wassenberg	cd530374b3	Further 1.02x prefill speedup from batch 64->512 Measured on SKX. Larger speedup expected for Zen4/SPR. PiperOrigin-RevId: 652472928	2024-07-15 07:26:10 -07:00
The gemma.cpp Authors	c879133a5a	Increase the prefill batch size to 64. PiperOrigin-RevId: 651754772	2024-07-12 06:28:37 -07:00
The gemma.cpp Authors	df3fb70802	Improve readability with RepeatedAttentionWindowSizes PiperOrigin-RevId: 651431738	2024-07-11 09:11:46 -07:00
Jan Wassenberg	edaf61b983	SVE build fix: avoid capturing vectors directly. Also use more V typedef instead of auto. PiperOrigin-RevId: 651423685	2024-07-11 08:43:56 -07:00
Jan Wassenberg	be765afce2	Simplify matmul: only 2 overloads Also add StoreHorizontalSumsMaybeAdd wrapper function, move MatMulSlowBatch into test. 1.02-1.06x speedup. PiperOrigin-RevId: 651394791	2024-07-11 06:58:42 -07:00
Andrey Vlasov	3e92088595	Remove allocation from GEMM_4x4_Tile when decoding compressed weights by implementing SfpCodec::Dec2F and ComressTraits<T>::Decompress2 for all supported types. It also allows to remove one of the specializations of GEMM_4x4_Tile, handling compressed MatB with one function. As before even when MatA is bf16 it is using 32-bit registers for computations. Measurements for a 2b-it sfp-encoded model on a AMD Ryzen Threadripper PRO 3945WX 12-Cores: baseline: ``` 32.6254 prefill tokens / sec 8.91429 tokens / sec 115 milliseconds time to first token ``` this change: ``` 54.3045 prefill tokens / sec 16.8191 tokens / sec 56 milliseconds time to first token ``` PiperOrigin-RevId: 651369694	2024-07-11 05:13:39 -07:00
Kan Wu	f519ab6693	Refactor configurables. PiperOrigin-RevId: 651259154	2024-07-10 21:30:58 -07:00
Andrey Vlasov	960ff4b4ec	Record time measurements in MatMul tests. PiperOrigin-RevId: 651060711	2024-07-10 10:04:40 -07:00
Daniel Keysers	063bbaa683	Add more comments to attention computation (and some small restructuring). PiperOrigin-RevId: 650929097	2024-07-10 02:39:07 -07:00
Jan Wassenberg	6a3f7cf3ea	Lint fix - string append, remove stale TODO PiperOrigin-RevId: 650197468	2024-07-08 04:11:21 -07:00
Jan Wassenberg	cbb67b4ee0	Move benchmark_helper to evals/, weights_raw to compression/. PiperOrigin-RevId: 650155983	2024-07-08 01:13:23 -07:00
Jan Wassenberg	438b1bace2	Fix handling of %c and %q if eot_string. Fixes #283 , thanks @ljcucc PiperOrigin-RevId: 649651535	2024-07-05 07:54:00 -07:00
Jan Wassenberg	118e802b00	Fix gemma_test - moved to evals/. PiperOrigin-RevId: 649338633	2024-07-04 02:04:05 -07:00
Jan Wassenberg	c7c3daa624	7x compile time speedup: shard gemma.cc Use overloaded functions defined in gemma/instantiations. Also split out activations.h. PiperOrigin-RevId: 649053122	2024-07-03 06:35:04 -07:00
Daniel Keysers	a40165dea2	Small cleanups. Fixes gemma_test build. PiperOrigin-RevId: 649008524	2024-07-03 03:13:38 -07:00
Kan Wu	7e4b20455e	Add sliding window attention for Gemma 2. PiperOrigin-RevId: 648778253	2024-07-02 11:08:03 -07:00
Jan Wassenberg	09a7e75ead	Prep for sharding gemma.cc: split into kv_cache, tokenizer. Move activations.h to backprop/ to make space for another activations.h. PiperOrigin-RevId: 648744500	2024-07-02 09:31:06 -07:00
Jan Wassenberg	85fcd3cd80	Cleanup: add ModelInfo struct, remove gcpp:: PiperOrigin-RevId: 648707763	2024-07-02 07:11:15 -07:00
Jan Wassenberg	b1c1ec1d59	Use benchmark_helper in py bindings (adds BOS) Also remove thread clamp (OK to be zero or large). PiperOrigin-RevId: 648657155	2024-07-02 03:27:15 -07:00
Jan Wassenberg	e527e7662e	Remove unused kSystemPrompt PiperOrigin-RevId: 648429567	2024-07-01 11:18:07 -07:00
Jan Wassenberg	af8eb2fde3	Declutter gemma/ directory, move binaries to evals/ and util/. PiperOrigin-RevId: 648400795	2024-07-01 09:51:04 -07:00
Jan Wassenberg	e588a7f45d	Add config for att/final cap, skip max-subtract. Fixes #278 Also update includes/deps for backprop/. PiperOrigin-RevId: 648399222	2024-07-01 09:45:26 -07:00
The gemma.cpp Authors	da7507e6f0	Add prompt batching to Gemma.cpp. This CL adds a new function to Gemma that allows for batching of multiple prompts. The function takes a vector of prompts and returns a vector of responses. The prompts are processed in parallel, and the responses are returned in the same order as the prompts. PiperOrigin-RevId: 648367559	2024-07-01 07:51:31 -07:00
Paul Chang	8ac5d66575	Introduce new Gemma 9B and 27B configs PiperOrigin-RevId: 647299080	2024-06-27 06:45:24 -07:00
Paul Chang	78e96fdc70	Refactor model type / training tables, simplify reverse mapping PiperOrigin-RevId: 647069372	2024-06-26 13:59:14 -07:00
The gemma.cpp Authors	7fc8ddf825	Fix a clang tidy warning PiperOrigin-RevId: 646498062	2024-06-25 09:02:59 -07:00
The gemma.cpp Authors	12089417b5	Improve logging when running Gemma examples: fix the issue when max_tokens, max_generated_tokens and temperature were logging without any trailing space/newline. PiperOrigin-RevId: 646014268	2024-06-24 02:00:34 -07:00
The gemma.cpp Authors	80b1347393	Skip the last RMSNormInplaceBatched in the Prefill phase. That only modifies activations.x, but it is called with prefill_activations which are not used after the Prefill call. PiperOrigin-RevId: 645391387	2024-06-21 08:04:22 -07:00
Copybara-Service	82f16087ba	Merge pull request #266 from ufownl:bugfix/kvcache PiperOrigin-RevId: 645329504	2024-06-21 03:06:52 -07:00
RangerUFO	f7855251ea	Fix compilation errors in clang It will occur in `ubuntu-latest` of GitHub Actions.	2024-06-21 13:40:40 +08:00
RangerUFO	d7787c8f6c	Fix KV cache size calculation error	2024-06-21 13:06:26 +08:00
Daniel Keysers	0570972d43	Fixing two typos. PiperOrigin-RevId: 645103198	2024-06-20 11:33:12 -07:00
The gemma.cpp Authors	a85725614a	Refactor kCachePosSize and kCacheLayerSize into separate functors. PiperOrigin-RevId: 645048519	2024-06-20 08:52:08 -07:00
Jan Wassenberg	48ebba8b7a	Code cleanup - Simplify template arg list, enable deduction - missing hn:: on " Lanes" - 1.0f suffix - move RMSNormBatched into ops.h - static constexpr -> constexpr - concrete type instead of LayerT, WeightArrayT - inline GetWeights - remove if (runtime_config.verbosity - merge AllocatePrefill and AllocateDecode - remove bf_ffw_hidden PiperOrigin-RevId: 644931277	2024-06-20 01:10:24 -07:00
The gemma.cpp Authors	658fb3e506	Move test placeholder to a later pos. PiperOrigin-RevId: 644808456	2024-06-19 13:24:10 -07:00
The gemma.cpp Authors	0e612d9a20	Split out common parts (embedder and transformer block) from Prefill() and Transformer() into separate functions. PiperOrigin-RevId: 644455520	2024-06-18 11:24:56 -07:00
Paul Chang	d7d9d14f0e	Move kGriffinLayers into ConfigNoSSM, set kGemmaLayers directly For regular (non-SSM) Gemma models, kGriffinLayers is by definition always zero and kGemmaLayers is just the number of layers. PiperOrigin-RevId: 644384531	2024-06-18 07:52:52 -07:00
Jan Wassenberg	70506b0a62	Fix debug_prompt and other binaries (internal init) PiperOrigin-RevId: 644367683	2024-06-18 06:48:59 -07:00
Jan Wassenberg	15135f5b3d	Simplify Attention. Shared kMHA, reuse from Activations, inline Attn lambda, use QDim as the stride between successive Q. PiperOrigin-RevId: 644343854	2024-06-18 05:08:12 -07:00
Jan Wassenberg	2ac47e4a06	Fix Py binding/run_example: use GemmaEnv PiperOrigin-RevId: 644318962	2024-06-18 03:20:22 -07:00
Jan Wassenberg	a07f60c9a1	1.15x 7b sfp prefill speedup: Matmul in attention 2b bf16: prefill 114.456 -> 115.222 decode 16.8847 -> 16.9987 7b sfp: prefill 18.8575 -> 21.7325 decode 5.68428 -> 5.79791 PiperOrigin-RevId: 644283676	2024-06-18 01:00:51 -07:00
Jan Wassenberg	704d936764	Further simplification to ForEachTensor, thanks I.K. PiperOrigin-RevId: 643996210	2024-06-17 07:12:26 -07:00
Jan Wassenberg	7d0720675f	Move raw_weights into separate header, used mainly by compress_weights. Fix warnings in backprop/* (include) PiperOrigin-RevId: 643983136	2024-06-17 06:17:02 -07:00
Jan Wassenberg	ad790d89d1	Fix DASSERT - TiledBatch requires at least 2 vectors. Also use shorthand for weight types. PiperOrigin-RevId: 643958371	2024-06-17 04:29:01 -07:00
The gemma.cpp Authors	7dbfa44794	Refactor CompressedWeights. PiperOrigin-RevId: 643934198	2024-06-17 02:54:54 -07:00
Ray Smith	e0afdfa8fb	Added bias vector addition to MatMul PiperOrigin-RevId: 643385381	2024-06-14 10:25:16 -07:00
The gemma.cpp Authors	2228055bb8	Internal change. PiperOrigin-RevId: 643330703	2024-06-14 06:53:41 -07:00
Jan Wassenberg	29c0c574e6	Integrate matmul into FFW: 4.3x prefill speedup ``` before, bf16: 27.2929 prefill tokens / sec 17.2114 tokens / sec after, bf16 116.496 prefill tokens / sec 17.5391 tokens / sec ``` PiperOrigin-RevId: 643328437	2024-06-14 06:32:26 -07:00
Ray Smith	198326a682	Removed now redundant non-batch matmul PiperOrigin-RevId: 643317187	2024-06-14 05:13:36 -07:00
Andrey Vlasov	b17631c95f	Implement a missing (bf16, f32) tiled MatMul kernel. PiperOrigin-RevId: 643313676	2024-06-14 04:54:40 -07:00
Jan Wassenberg	d3c6a45b59	Major duplicated code reduction in test/benchmarks Helper functions to tokenize/wrap Move LayersOutputFunc into RuntimeConfig AcceptFunc passes the probability Implement StringFromType using the parser, and verify results match PiperOrigin-RevId: 643255119	2024-06-14 00:16:25 -07:00
Jan Wassenberg	c15ff9529c	Reduce duplication in Config* by inheriting no-SSM PiperOrigin-RevId: 643030629	2024-06-13 09:48:56 -07:00
Ray Smith	ea525da967	Added MatMul_4x4_Batch which is MatMul_4x4, but with the first template arg moved to the first function arg, so the batch size (num A rows) can be variable at run-time. PiperOrigin-RevId: 643017973	2024-06-13 09:05:40 -07:00
The gemma.cpp Authors	1b40619864	Increase parallelism in ops_test PiperOrigin-RevId: 643013415	2024-06-13 08:50:41 -07:00
Andrey Vlasov	38eb452b94	Support mixed (bf16, sfp) tiled MatMul. Same sfp-decompress strategy as in (f32, sfp) tiled MatMul. PiperOrigin-RevId: 642901844	2024-06-13 02:07:21 -07:00
Daniel Keysers	6e67a6d8a9	Tiny cleanup: distinguish between "ids" and "pieces" in argument names when encoding. PiperOrigin-RevId: 642614278	2024-06-12 07:52:13 -07:00
Daniel Keysers	1ac9857014	Extends Transformer() to prepare for batched processing. PiperOrigin-RevId: 642603025	2024-06-12 07:01:03 -07:00
The gemma.cpp Authors	2a0e6ee976	Fix numerical issue in Softcap by subtracting max. Also update test threshold. PiperOrigin-RevId: 642587468	2024-06-12 05:42:16 -07:00
The gemma.cpp Authors	f467670de7	Implement float * SfpStream matmul by decompressing 4 * kColsA_RowsB -sized chunks of the second matrix. PiperOrigin-RevId: 642533996	2024-06-12 01:11:59 -07:00
Ray Smith	bdf33c7008	Updated benchmarks.cc to recent changes to Gemma API. PiperOrigin-RevId: 642285902	2024-06-11 08:55:40 -07:00
Phil Culliton	b6565e3bf6	Update AssertClose for large matrices and add large matrix test PiperOrigin-RevId: 642277221	2024-06-11 08:22:47 -07:00
Jan Wassenberg	3e2396f98c	Use Loader/AppArgs to construct gemma_test model, simplify AcceptFunc accept_token: allow default, check if empty when using allow mixing sample_func and stream_func, call the latter after the former Also fix missing includes/deps. PiperOrigin-RevId: 642240012	2024-06-11 05:53:10 -07:00
Daniel Keysers	c557ad23a8	Adds simple-loop versions of missing batched functions. PiperOrigin-RevId: 642189741	2024-06-11 02:14:02 -07:00
Jan Wassenberg	c7f5e93136	Update benchmark with internal init PiperOrigin-RevId: 641929308	2024-06-10 09:35:16 -07:00
Copybara-Service	49d814b519	Merge pull request #224 from szabadka:cleanup PiperOrigin-RevId: 641922102	2024-06-10 09:11:13 -07:00
Jan Wassenberg	c1c6714ad4	Internal experiment PiperOrigin-RevId: 641915024	2024-06-10 08:46:10 -07:00
Zoltan Szabadka	a3a75b77f9	Use CompressedWeights<TConfig<float>> in backpropagation. kWeightsAreCompressed are removed and LoadRawWeights is moved to compress_weights.cc	2024-06-10 14:34:24 +00:00
Phil Culliton	c5bcb5438c	Fix for transpose matrix creation and additional tests PiperOrigin-RevId: 641868053	2024-06-10 05:24:04 -07:00
Jan Wassenberg	36e6915e18	Add CPU output, error if not C++17, simplify tokenizer ctor PiperOrigin-RevId: 641850879	2024-06-10 04:01:11 -07:00
Phil Culliton	d985d8b867	Shifting large matrix init to heap in ops_test.cc PiperOrigin-RevId: 641311100	2024-06-07 11:38:42 -07:00
Jan Wassenberg	f9b390b134	Support all weight types in a single binary. This changes the command line flags, but the default value retains the previous behavior. Also add a CreateGemma helper to enable extra args without interface changes. PiperOrigin-RevId: 641266411	2024-06-07 09:04:45 -07:00
Copybara-Service	24db2ff725	Merge pull request #217 from szabadka:cross-entropy PiperOrigin-RevId: 641241133	2024-06-07 07:17:35 -07:00
Daniel Keysers	06f814fc8b	Small code cleanup suggestions while reading the code. PiperOrigin-RevId: 641220788	2024-06-07 05:33:17 -07:00
Zoltan Szabadka	465998d25a	Add support for custom sampling function to runtime config. With this addition the ComputeCrossEntropy function can be moved to its own library, because now we can compute it using only the public API functions from gemma.h	2024-06-07 11:45:07 +00:00
Copybara-Service	f7ac7092d6	Merge pull request #212 from szabadka:adam2 PiperOrigin-RevId: 641182573	2024-06-07 02:25:18 -07:00
Zoltan Szabadka	c004799cdc	Add Adam optimizer. Drive-by: Fix compilation errors and tests for backprop functions.	2024-06-06 18:41:36 +00:00
Jan Wassenberg	12707ade80	Toward only using compressed weights: CompressedLayer should all be f32 when weights are f32. PiperOrigin-RevId: 640954519	2024-06-06 11:00:23 -07:00
Paul Chang	6c0be20fa6	Fix Softmax on SVE PiperOrigin-RevId: 640947138	2024-06-06 10:39:30 -07:00
The gemma.cpp Authors	39d4115717	Implement mixed mode matmul: f32 * bf16 PiperOrigin-RevId: 640940962	2024-06-06 10:21:46 -07:00
Jan Wassenberg	57c2cd8b52	Simplifications: remove GemmaInterface and GemmaImpl Split common and weights into separate lib Remove common-inl (does not have to be SIMD code), activations.cc Centralize switch(Model) to avoid duplication Move CompressWeightsT to compress_weights.cc Move LoadWeights to weights.cc PiperOrigin-RevId: 640869202	2024-06-06 05:54:21 -07:00
Jan Wassenberg	5c3e5f7038	Remove no longer required stats.h - use Highway version instead PiperOrigin-RevId: 640440379	2024-06-05 01:37:48 -07:00
Paul Chang	175e389c3c	revert back to HWY_ASSERT for lane constraints, qualify hn::Add PiperOrigin-RevId: 640193239	2024-06-04 10:10:18 -07:00
Phil Culliton	e71d82ead9	Fix for GenerateZeroMat call in TestTiledMatMul PiperOrigin-RevId: 640180868	2024-06-04 09:32:23 -07:00
Zelalem Aweke	9e213b3d96	Use system topology to pin threads across clusters. PiperOrigin-RevId: 640151974	2024-06-04 07:50:32 -07:00
Jan Wassenberg	4f9155d8c6	Add bf16 matmul support, update naming+test Avoid int32, which can easily overflow for large matrices. Also fix IDE warning in sfp-inl. PiperOrigin-RevId: 640149845	2024-06-04 07:41:46 -07:00
Zoltan Szabadka	df01700b54	Move the backpropagation code to its own directory	2024-06-04 10:20:16 +00:00
Zoltan Szabadka	3b4fa4a0e3	Use HWY_EXPORT_AND_DYNAMIC_DISPATCH_T where possible.	2024-06-04 09:18:56 +00:00
Zoltan Szabadka	8567978541	Adress review comments	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	7e639856da	Fix compilation and tests for gcc	2024-06-04 08:37:54 +00:00
Zoltan Szabadka	36e4d8bbfe	Add first version of backpropagation support. This is still in progress / experimental, currently it is only implemented for normal gemma MQA attention layers, and no parallelism is added yet for backward pass. Since we need to remember all activations from all layers, the forward pass was also reimplemented with a new activation data structure.	2024-06-04 08:37:49 +00:00
Paul Chang	ed8f39c058	Refactor GemmaImpl dispatch to use Highway 1.2's HWY_DYNAMIC_DISPATCH_T PiperOrigin-RevId: 639793810	2024-06-03 08:32:29 -07:00
Jan Wassenberg	a44cbdadc2	Update to Highway 1.2 for topology/VQSelect Also fix unused-warning in compress-inl. PiperOrigin-RevId: 639116915	2024-05-31 12:29:10 -07:00
Paul Chang	5feacf120c	static_assert shape constraints in MatMul 4x4 PiperOrigin-RevId: 639069345	2024-05-31 10:02:45 -07:00
Phil Culliton	c616abe628	Unrolled / tiled 4x4 MatMul PiperOrigin-RevId: 638384686	2024-05-29 13:02:35 -07:00

... 2 3 4 5 6 ...

401 Commits