gemma.cpp

Commit Graph

Author	SHA1	Message	Date
Jan Wassenberg	87a658b1c6	Minor cleanup, on-demand NUQ buffer allocation threading_context: add profiler compress-inl: add constexpr, on-demand alloc NUQ buffer gemma_py: model->gemma Move ScaleWeights to compress.cc Move PromptWrapping to configs.h PiperOrigin-RevId: 748347896	2025-04-16 10:49:43 -07:00
The gemma.cpp Authors	7164a5e844	Internal change. PiperOrigin-RevId: 746953110	2025-04-12 20:27:49 -07:00
Jan Wassenberg	2e722f14f1	Add mmap support (not yet used) Also: const-correct ArgsBase, add assert to mat.h checking element_bytes_ BUILD deps update (:shared provides shared.h, not :sfp) PiperOrigin-RevId: 746073312	2025-04-10 10:03:40 -07:00
Jan Wassenberg	8532da47f7	Major refactor of allocator/args: use new ThreadingContext2 instead of monostate/init in each frontend Add ThreadingArgs(replaces AppArgs) backprop: use Packed() accessor and MakePacked factory and row-based access to allow for stride compress_weights: remove, moving to py-only exporter instead Move MatPtr to mat.h and revise interface: - Generic MatOwner - rename accessors to Packed* - support stride/row accessors, fix RowPtr stride Add TypeBits(Type) Move GenerateMat to test_util-inl for sharing between matmul test/bench Move internal init to gemma.cc to avoid duplication Rename GemmaEnv model_ to gemma_ for disambiguating vs upcoming ModelStorage Remove --compressed_weights, use --weights instead. tensor_index: add ExtentsFromInfo and TensorIndexLLM/Img Allocator: use normal unique_ptr for AllocBytes so users can call directly threading: use -> because AlignedPtr no longer assumes arrays PiperOrigin-RevId: 745918637	2025-04-10 01:29:54 -07:00
Copybara-Service	bef91a3f03	Merge pull request #529 from ufownl:refactor/wrap_and_tokenize PiperOrigin-RevId: 745174371	2025-04-08 09:22:26 -07:00
Jan Wassenberg	5d4f7e0f7e	Add new singleton Allocator2 instead of monostate Not yet used. Also fix format-string warning in topology.cc. PiperOrigin-RevId: 745166210	2025-04-08 09:00:59 -07:00
Jan Wassenberg	4e6aa36e9b	Minor cleanup: enable 0,0 Extents2D, add SerializedSpan typedef, include fixes PiperOrigin-RevId: 745068776	2025-04-08 03:35:55 -07:00
RangerUFO	cc2e14e654	Improve `GemmaChatTemplate` to handle vision prompt wrapping	2025-03-29 11:31:40 +08:00
RangerUFO	c39295f497	Inline the ctor of `GemmaChatTemplate`	2025-03-29 11:31:40 +08:00
RangerUFO	d1615b56b2	Fix the prompt wrapping of gemma3-1b again It seems that the previous fix was changed back due to a merge error.	2025-03-29 11:31:39 +08:00
RangerUFO	ca4ee2b63f	Refactor `WrapAndTokenize` to work properly with Gemma3	2025-03-29 11:31:39 +08:00
Jan Wassenberg	76a81ac2d6	Fix unaligned buffer causing crash on GCC. Thanks @ufownl, fixes #508 PiperOrigin-RevId: 741590339	2025-03-28 11:25:33 -07:00
Jan Wassenberg	e55734219d	Fix test threshold and improve warning output PiperOrigin-RevId: 740738937	2025-03-26 06:11:27 -07:00
Copybara-Service	4a924f1794	Merge pull request #527 from ufownl:feature/gemma2_secondary_eos PiperOrigin-RevId: 740327973	2025-03-25 06:44:41 -07:00
RangerUFO	d42deaa27c	Set the secondary EOS for Gemma2 So that we can remove the `<end_of_turn>` filter that was set up specifically for Gemma2.	2025-03-22 01:32:22 +08:00
RangerUFO	2bad79f110	Fix the EOS checking The secondary eos is usually `<end_of_turn>`, which can appear in the prompt, so we can only check it not in the prompt.	2025-03-22 01:32:22 +08:00
Jan Wassenberg	6300c123ee	Update app argument documentation PiperOrigin-RevId: 739159864	2025-03-21 06:33:30 -07:00
Phil Culliton	05b1cce9f7	Add support for a secondary EOS token PiperOrigin-RevId: 738898976	2025-03-20 12:28:31 -07:00
Jan Wassenberg	83219e3c68	Add note on attention length and SFP PiperOrigin-RevId: 738698399	2025-03-20 00:39:06 -07:00
pculliton	3d419ec173	Merge pull request #523 from ufownl/bugfix/gemma3_1b_wrapping Fix the prompt wrapping of gemma3-1b	2025-03-19 10:30:27 -04:00
RangerUFO	b16ce9a0b4	Fix the prompt wrapping of gemma3-1b	2025-03-18 16:52:38 +08:00
Jan Wassenberg	1b72c22345	Refactor Gemma ctor and improve pool NUMA support Gemma receives a MatMulEnv arg, with comment on lifetime Split threading into topology so the latter can be used in allocator Add AllocClasses() for non-POD (ThreadPool) Support binding pool to NUMA node Update threading_test with latency measurements Also update Highway version. PiperOrigin-RevId: 736904748	2025-03-14 10:19:00 -07:00
Phil Culliton	1b1b63d560	Fix PaliGemma models. PiperOrigin-RevId: 736483021	2025-03-13 06:28:29 -07:00
Quirin Niedernhuber	0ff6b3123a	Point out Gemma 3 support in README.md PiperOrigin-RevId: 736125794	2025-03-12 07:33:30 -07:00
Jan Wassenberg	5898fa5eb0	Update github actions/cache version PiperOrigin-RevId: 736120661	2025-03-12 07:12:55 -07:00
Phil Culliton	4ab601da10	Internal change. PiperOrigin-RevId: 736015810	2025-03-11 23:20:20 -07:00
Phil Culliton	9d83ff202e	Internal change. PiperOrigin-RevId: 736014152	2025-03-11 23:10:48 -07:00
Jan Wassenberg	2bdf26d81d	Support bf16 output of Matmul Adds Stride to ConstMat, to support decompression of C output for test matmul_test: add line numbers to output Also ignore "N is not a multiple of nc" when N==nc PiperOrigin-RevId: 731096662	2025-02-25 17:53:20 -08:00
Jan Wassenberg	b3b4b9f92f	With new matmul, much larger batch sizes are advantageous, default to 256. Can still override via command line argument. PiperOrigin-RevId: 730502653	2025-02-24 10:21:58 -08:00
Jan Wassenberg	9a2360d719	Move batch_bench into test section, add GTest dep. Fixes #501 PiperOrigin-RevId: 729494223	2025-02-21 05:33:52 -08:00
Jan Wassenberg	f9d93e4a42	Matmul rewrite: fp64 sums, hierarchical parallelization, cache-blocking, autotuning Remove empty matmul_unit_test. Up to 25 TFLOP/s on 2xZen4 for 512,3072,24576. PiperOrigin-RevId: 729123576	2025-02-20 08:33:46 -08:00
Apoorv Reddy	d854471ae2	Use vectorized TopK using highway VQSelect PiperOrigin-RevId: 728159153	2025-02-18 05:01:39 -08:00
Apoorv Reddy	0e5b59d24d	Implements FusedSoftmaxAndSampleTopK. This computes softmax on the top-K logits, instead of computing softmax first and then getting top-K probs. So we end up avoiding renormalizing too. Additionally, modify softmax to do temperature scaling, if temp != 1.0 PiperOrigin-RevId: 727702149	2025-02-16 21:30:06 -08:00
Jan Wassenberg	bdf5d25e97	Only temporarily enable spinning in threading benchmark PiperOrigin-RevId: 727114863	2025-02-14 17:15:38 -08:00
Jan Wassenberg	06c70dccd9	Less verbose threading_test output, improve formatting. PiperOrigin-RevId: 726364085	2025-02-13 00:56:34 -08:00
Daniel Keysers	f173aa776e	Add conversion tool for HF safetensors to gemma.cpp for PaliGemma. PiperOrigin-RevId: 725990158	2025-02-12 03:47:43 -08:00
Copybara-Service	c495b25995	Merge pull request #493 from ufownl:bugfix/compress_weights_le PiperOrigin-RevId: 725585921	2025-02-11 05:10:13 -08:00
Apoorv Reddy	64cf6dfe0a	Using TimingInfo methods and cleaning up args to DecodeStepT PiperOrigin-RevId: 725580125	2025-02-11 04:49:14 -08:00
Jan Wassenberg	953c877658	Fix nuq Enc() to handle groups < kGroupSize. Also remove no longer required dynamic allocation. PiperOrigin-RevId: 725203824	2025-02-10 07:17:59 -08:00
Jan Wassenberg	5563d94811	Add fork/join latency benchmark PiperOrigin-RevId: 725174042	2025-02-10 05:23:44 -08:00
Apoorv Reddy	780e376023	Add KVCache.DeepCopy() . Will be useful for implementing sampling functionality like beam sampling, parallel sampling, CoT Decoding (à la https://arxiv.org/abs/2402.10200 ) PiperOrigin-RevId: 725156316	2025-02-10 04:10:29 -08:00
Apoorv Reddy	9b3e7ea8a2	Factor out DecodeStepT from GenerateT into a separate function. This will be useful for adding sampling functionality like beam decoding, parallel sampling, cot decoding (as described in the [Chain-of-Thought Reasoning Without Prompting paper](https://arxiv.org/abs/2402.10200)) PiperOrigin-RevId: 725151530	2025-02-10 03:53:08 -08:00
Jan Wassenberg	b0fe9a43e6	Further speed up blob_compare: single alloc, use dual sockets PiperOrigin-RevId: 724947361	2025-02-09 10:53:49 -08:00
RangerUFO	3a5a6dbcad	Fix the link error when building `compress_weights` with Clang on macOS	2025-02-09 00:13:25 +08:00
Jan Wassenberg	b18bd781f6	Windows build fixes: struct vs class, unused arg/var, avoid VLA, Deleter arg, casts PiperOrigin-RevId: 724340518	2025-02-07 07:38:55 -08:00
Oleh Prypin	82ca526c0c	Remove `srcs_version` and `python_version` attributes, as they already default to `"PY3"` PiperOrigin-RevId: 724122259	2025-02-06 16:51:11 -08:00
Jan Wassenberg	f31e12e63b	Improved blob diff: parallel, tolerance for float PiperOrigin-RevId: 724060325	2025-02-06 13:46:28 -08:00
Jan Wassenberg	9f5159ff68	Public visibility for compression/ PiperOrigin-RevId: 723529541	2025-02-05 08:53:51 -08:00
Phil Culliton	7ccc6abe87	Allow conversion, loading and inference with NUQ. PiperOrigin-RevId: 723507890	2025-02-05 07:45:54 -08:00
Phil Culliton	8a6edff319	Base interleaved handling for 4.5-bit NUQ, specifically Enc, DecompressAndZeroPad, and Dec2. Includes tests. PiperOrigin-RevId: 721821577	2025-01-31 10:35:32 -08:00

1 2 3 4 5 ...

598 Commits All Branches Search

598 Commits

All Branches